How Alibaba’s UC Team Boosted Short‑Video Recommendations with FM+GBM
This article details the evolution of Alibaba's short‑video feed ranking system, from a Wide&Deep CTR model to a hybrid Factorization‑Machine and Gradient‑Boosted‑Tree approach, describing feature engineering, model architecture, experimental results, lessons learned, and future directions toward duration‑based relevance.
Background
Short‑video feeds dominate mobile internet traffic, relying on algorithmic distribution to achieve personalized, per‑user results. The recommendation pipeline consists of trigger recall, ranking, and re‑ranking, with the ranking layer serving as a crucial bridge between recall and final presentation.
Current Model and Improvements
The initial ranking model used a Wide&Deep architecture optimized for click‑through‑rate (CTR). By incorporating video playback duration as a feature and adopting a multi‑objective optimization of click + duration, the team achieved notable gains, such as increased average watch time and better handling of diverse distribution scenarios.
Model Evolution
To overcome the limitations of Wide&Deep, the team introduced Factorization Machines (FM), which replace high‑order feature cross products with low‑dimensional latent vectors, reducing complexity from O(N²) to O(N·k) while improving generalization. FM serves as a linear model that captures feature interactions efficiently.
The FM formulation for second‑order interactions is illustrated in the diagram above.
FM+GBM First Phase (Pure GBM)
In the first phase, an experimental framework and data pipeline were built. GBM combined sub‑model scores (Wide&Deep, LR), click/duration signals, and simple matching features. A new GBMScorer component was added to the ranking server to:
Decide via traffic bucketing whether to apply GBM scoring.
Normalize features and feed them back to the log server for offline training.
Key lessons included the importance of sample and hyper‑parameter selection (e.g., tree depth = 6), fine‑grained AUC evaluation per request, and feature normalization—especially for user‑related features—to ensure model convergence.
FM+GBM Second Phase
The second phase introduced additional signals, addressing two main challenges: sparse bag‑of‑words features that fail to capture semantic similarity, and low coverage of structured video metadata. Embedding‑based representations from the WD model partially mitigated these issues, but required version alignment across model updates.
To align embeddings, the system retains the two most recent versions and selects the latest aligned vectors during online inference, ensuring consistency despite a 4–6 hour training window.
Results and Conclusions
Combining FM and GBM yielded a 10 % offline AUC improvement and a 6 % increase in online CTR and per‑user clicks. The ranking stack now follows a LR → Wide&Deep → FM+GBM funnel, balancing model complexity, feature richness, and computational cost. Future work will shift focus from click prediction to duration prediction to better capture true user interest.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Alibaba Cloud Developer
Alibaba's official tech channel, featuring all of its technology innovations.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
