Standardizing Model Training and Feature Processing in Recommendation Systems

This article describes a standardized workflow for feature collection, configuration, processing, and model training/prediction in large‑scale recommendation systems, using CSV‑based definitions and code generation to ensure consistency between offline training and online serving while reducing manual coding effort.

DataFunTalk
DataFunTalk
DataFunTalk
Standardizing Model Training and Feature Processing in Recommendation Systems

The talk, presented by senior Tencent engineer Liang Chao, focuses on the standardization of model training and usage processes within recommendation systems, where click‑through‑rate (CTR) estimation models are critical and complex.

In a typical recommendation pipeline, millions of items are filtered to a few dozen candidates through three steps: candidate generation using user, item, and context features; ranking with CTR or preference models; and final selection applying business rules and diversity constraints.

Key challenges include rapid feature iteration, maintaining consistency between online and offline feature handling, and supporting multiple model types. To address these, the team introduced a CSV‑driven feature framework that defines each feature’s name, type, position, and processing logic, allowing feature configuration, collection, and processing to be managed without extensive hand‑written code.

The framework standardizes feature types (int, sparse int, string, sparse string) inheriting from a base Feature class, ensuring identical serialization/deserialization for online logs and offline training samples. Feature processing follows three steps: filling, value/weight transformation, and joint vector transformation, illustrated with a tag feature example.

Training samples are generated in libsvm or sparse tensor formats, with the CSV converted to C++ header files and compiled into executables or shared objects that transform raw logs into training data. Dynamic compilation integrates custom TensorFlow operators when the CSV changes.

Additional components include feature monitoring (e.g., tag interest distribution), sample filtering and weighting to mitigate bot traffic, and a full system flow that ties feature configuration, code generation, online prediction, offline training, and re‑ranking together.

Overall, the CSV‑based standardization dramatically reduces manual coding, improves efficiency, and minimizes bugs in the feature engineering and model iteration lifecycle of recommendation systems.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

machine learningfeature engineeringCTR predictionrecommendation systemstandardizationModel Training
DataFunTalk
Written by

DataFunTalk

Dedicated to sharing and discussing big data and AI technology applications, aiming to empower a million data scientists. Regularly hosts live tech talks and curates articles on big data, recommendation/search algorithms, advertising algorithms, NLP, intelligent risk control, autonomous driving, and machine learning/deep learning.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.