Intelligent Recommendation System for 58 Tongzhen: Architecture, Data, Features, and Model Evolution
This article describes how 58 Tongzhen leverages AI technologies—including data pipelines, feature engineering, various recall and ranking models, and AB‑testing—to build a personalized feed recommendation system for the down‑market, detailing its overall architecture, data sources, model iterations, performance gains, and future directions.
Background AI is a strategic technology driving productivity and new services, yet adoption remains uneven across regions; fourth‑ and fifth‑tier cities and rural areas still face information gaps, creating a strong demand for AI‑enabled solutions.
58 Tongzhen, a key strategic business covering over 10,000 town stations in 31 provinces and serving more than 100 million users, aims to provide precise local information by combining private traffic from town‑level site owners with public traffic from the 58 local app, using AI to improve user profiling, conversion, and experience.
Scenario Overview The Tongzhen intelligent recommendation adopts a feed‑style UI to deliver multi‑category content (news, jobs, housing, cars, social) to down‑market users, supporting high‑growth, efficient conversion and long‑term retention.
Overall Architecture The system consists of five layers: data foundation, data computation, algorithm strategy, logic, and application. It ingests business, log, and label data, applies machine‑learning, deep‑learning, and NLP techniques for recall and click‑through‑rate (CTR) prediction, and merges top‑N results across categories for the homepage feed.
Data & Features Core data includes business transactions, behavior logs, and user profile tags. Feature engineering transforms raw logs into training samples via cleaning, sampling, combination, transformation, and discretization. Content sources cover news (text, image, video) and classified listings (jobs, housing, cars, etc.). Tags are extracted using BERT‑based models (e.g., job title extraction with 93 % accuracy, location and housing attributes around 80 %). Semantic, hidden‑semantic, spatio‑temporal, and quality features (including low‑quality and low‑vulgarity classifiers) are incorporated.
Algorithm Models – Recall Multiple recall strategies are employed: user‑profile tag recall, text‑similarity recall (TF‑IDF + Word2Vec), algorithmic model recall (ItemCF, Attention, DeepFM), hotspot recall (regional and global), and bandit‑based cold‑start recall. Real‑time and offline pipelines run on Kafka, Spark Streaming, and Flink.
Algorithm Models – Ranking Ranking evolved through four stages: (1) rule‑based sorting; (2) tree‑model + linear model (GBDT+LR, later XGBoost+LR) with feature sampling and regularization; (3) deep learning models (DeepFM, XDeepFM) for higher‑order feature interactions; (4) fusion of XGBoost+LR and XDeepFM, yielding ~5 % additional CTR lift.
Fusion Control Model A re‑ranking layer balances traffic and diversity across content categories by normalizing scores from category‑specific models and applying weighted aggregation based on user‑region preferences and business rules.
AB Testing & Evaluation An AB‑test platform orthogonally splits traffic at recall, ranking, fusion, and presentation layers, supporting UV/PV‑based splits and dynamic configuration. Continuous offline and online evaluations have increased overall CTR by ~175 % since April, with detailed trend charts demonstrating the improvement.
Conclusion & Future Work The system emphasizes robust user profiling and feature extraction, addressing the challenges of heterogeneous, locally‑focused content. Future plans include deeper user intent mining, richer contextual tags, new network architectures (including vision and reinforcement learning), multi‑objective optimization, and expansion of the user‑profile knowledge graph.
References 1. https://arxiv.org/pdf/1803.05170.pdf 2. https://blog.csdn.net/yfreedomliTHU/article/details/91386734
Author Yan Wenchang – 58 Algorithm Architect & Technical Committee Member.
58 Tech
Official tech channel of 58, a platform for tech innovation, sharing, and communication.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.