58 Tongzhen Home Feed Recommendation System: Architecture, Features, and Evolution
This talk details the design, data pipeline, feature engineering, model evolution, and operational insights of the 58 Tongzhen home feed recommendation system, covering its architecture, localization strategies, recall and ranking models, online learning, and future directions for AI-driven content delivery in the down‑market.
The presentation introduces the 58 Tongzhen business, a strategic initiative of 58.com targeting the down‑market (county and township level) with a local information platform serving over 100 million users nationwide.
Market analysis shows that down‑market users are predominantly aged 20‑50, have lower incomes but high disposable income, spend significant time on mobile devices, and prefer short videos, news, and social apps.
58 Tongzhen's recommendation scenario focuses on the home‑page Feed, delivering multi‑category information (news, jobs, real estate, cars, social) primarily through a Feed flow. News accounts for about 90% of the content.
System Architecture : The overall architecture consists of a data layer, computation layer, algorithm layer, logic layer, and application layer. It integrates business, log, and label data, and supports machine learning, deep learning, and NLP for recall and click‑through‑rate (CTR) prediction. The architecture is modular, decoupled, and supports AB testing.
Data & Feature Engineering : Data sources include business data, behavior logs, and tags. Features are categorized into user features, content features, cross features, and context features. Text tags are generated using a customized BERT model trained on millions of news articles. Additional features include semantic, implicit, and key‑entity tags, as well as low‑quality content detection using perplexity.
The feature pipeline processes data via Hive/Kafka ingestion, Flink/Spark ETL, and stores offline features in Hive/HDFS while pushing online features to caches (Redis, WTable) for real‑time use.
Recall Strategies : Multi‑path recall combines offline (CF, Attention, DeepFM) and real‑time strategies such as precise user profiling, content similarity, machine‑learning recall, regional hotspot recall, and freshness (race) strategies. User clustering with Word2Vec embeddings and K‑means is used to capture multiple interests.
Ranking Models : The ranking pipeline evolved from rule‑based sorting to tree‑based (GBDT/XGBoost) + linear models, then to deep models (XDeepFM) and finally to online learning. XDeepFM integrates linear, CIN (Compressed Interaction Network), and DNN components to capture high‑order feature interactions. Model hyper‑parameters are tuned via grid search and cross‑validation.
Online learning collects real‑time user feedback, updates models continuously, and improves CTR and user experience with minute‑level model refreshes.
Evaluation & Results : Extensive AB testing shows CTR improvements of over 220%, per‑user click increases of 170%, and next‑day retention gains of 92% compared to the baseline. The system processes millions of daily interactions, with training cycles of about one hour for 7‑day data.
Key Takeaways : Data‑driven understanding of business goals, robust feature engineering, iterative recall optimization, meticulous experimentation, and attention to detail are crucial for sustained recommendation performance.
Future Plans : Deploy multi‑objective optimization (ESMM‑style), deepen user intent modeling, explore reinforcement learning and graph neural networks, expand content tagging and knowledge graph construction, and continue enhancing the recommendation pipeline.
DataFunTalk
Dedicated to sharing and discussing big data and AI technology applications, aiming to empower a million data scientists. Regularly hosts live tech talks and curates articles on big data, recommendation/search algorithms, advertising algorithms, NLP, intelligent risk control, autonomous driving, and machine learning/deep learning.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.