How Feizhu Upgraded Its Recommendation Engine from Linear to End‑to‑End Deep Models
This article details the evolution of Feizhu's "Guess You Like" ranking system, moving from a linear FTRL model to several end‑to‑end deep learning versions—including PALM, FB‑PALM, and GLA—highlighting technical challenges, architectural changes, and measurable performance gains.
Introduction
Feizhu's "Guess You Like" ranking model was upgraded from a linear FTRL model to an end‑to‑end deep model, undergoing multiple version iterations such as PALM, FB‑PALM, and GLA, with various technical improvements documented.
Problem Analysis
While the existing feature set was extensive, the linear model left room for richer feature interactions. End‑to‑end deep models can implicitly and explicitly cross features, but they demand careful feature selection, handling of high‑dimensional ID features, and attention to model generalization, training over‑fit, and online latency.
Model Iterations
PALM (Pure Adaptive L2 Model)
Initial attempts to migrate all features to the deep side of a Wide‑and‑Deep architecture yielded limited offline gains. Pure deep models suffered from over‑fitting, especially with high‑dimensional sparse ID features.
High‑dimensional ID features (item ID, user ID, trigger ID) required adaptive L2 regularization to prevent over‑fitting.
Lookup (hit) features performed better when represented as dense values rather than zero‑filled embeddings.
Warm‑up + Adam + learning‑rate decay provided the largest offline improvement.
Batch normalization after embedding and before fully‑connected layers stabilized training and improved convergence.
Model structure:
The loss function is pointwise, and offline evaluation uses T+1 AUC on the same training window.
FB‑PALM (FeedBack‑PALM)
After PALM, the team added real‑time click and non‑click behavior sequences. Various deep CTR blocks (DCN, DeepFM, XDeepFM, etc.) were tested but offered only marginal gains over the pure deep baseline.
Key enhancements:
Incorporated short‑term global item click sequences and exposure‑without‑click sequences as additional inputs.
Used attention pooling over these sequences, concatenated with pure deep inputs, and fed into a multi‑layer feed‑forward network.
Results: uCTR +1.0% and pCTR +1.5% compared with the pure deep model.
GLA (Global Local Attention‑PALM)
To address coverage gaps, a full‑site behavior sequence (including flights, trains, hotels, and items) was introduced. The previous additive attention pooling was replaced with a transformer‑plus‑attention pipeline to capture intra‑sequence relations.
Full‑site sequences use only high‑coverage attributes (destination, category, POI, tag, behavior type) without ID features to avoid over‑fitting.
Sequence pooling combines Multi‑CNN extraction with attention, sharing parameters across different sequence types.
Transformer layers precede additive attention for item click and non‑click sequences, capturing self‑attention among sequence elements.
Results: uCTR +1.0% and pCTR +3.0% over the FB‑PALM model.
Other Attempts
First‑order neighbor information and pretrained embeddings (text, image, DeepWalk) gave large gains on the pure deep baseline but contributed little when stacked on later versions.
Time‑aware attention was explored by adding temporal features to the attention mechanism; it yielded modest offline improvements but was not deployed.
Future Outlook
Key challenges remain in learning robust models across heterogeneous data sources and exploring explicit feature crossing to further boost performance.
Alibaba Cloud Developer
Alibaba's official tech channel, featuring all of its technology innovations.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
