How Toutiao Scales Personalized News: Architecture and Recommendation Engine
This article outlines Toutiao's rapid growth, its data pipelines, user modeling, recommendation system, storage solutions, and push notification strategies that together enable personalized news delivery to hundreds of millions of users.
Product Background
Toutiao, founded in March 2012, grew from a handful of engineers to over 200 staff within four years, expanding its product line from short jokes to news, special sales, and movies.
It serves personalized news to users, currently boasting 500 million registered users, 48 million daily active users, and 5 billion daily page views, with users spending over 65 minutes per session.
Technical and Architecture Evolution
Toutiao’s technology stack processes around 10,000 original news articles daily, along with content from various websites, novels, and blogs. Crawlers collect the data, which is then manually filtered for sensitive content.
Text analysis extracts categories, tags, topics, regional information, popularity, and weight for each article.
User Modeling
User actions are logged in real time using tools such as Scribe, Flume, and Kafka. Interest mining leverages Hadoop and Storm, and the resulting models are stored in MySQL/MongoDB (with read‑write separation) and cached in Memcached/Redis.
By 2015 the user‑modeling cluster comprised roughly 7,000 machines, handling dimensions like subscriptions, tags, and article push strategies, enabling continuous recommendation.
Cold Start for New Users
Toutiao identifies new users via device, OS, and app version, and enriches profiles using social logins (e.g., Weibo) to capture friends, followers, and activity.
Additional signals include installed apps, device models, browser bookmarks, and subscribed channels (movies, jokes, products, etc.).
Recommendation System
The core recommendation engine includes automatic and semi‑automatic components:
Automatic: candidate generation, user matching (e.g., location, profile), and push task creation, requiring high‑throughput, massive‑scale delivery.
Semi‑automatic: candidate selection combined with user behavior signals.
Channels are divided into classification, interest tags, keyword, and text‑analysis, supported by over 300 classifiers that continuously evolve.
Data Storage
Persistent storage uses MySQL or MongoDB together with Memcached/Redis, employing many databases and large‑memory instances, and experimenting with SSDs.
Images are stored in the database and served via a CDN.
Message Push
Push notifications increase daily active users by about 20% and their absence can reduce DAU by roughly 10% (2015 data). Key metrics include click‑through rate, click count, app uninstall, and push disable counts.
Push content is personalized by frequency, content, region, and interest, e.g., delivering city‑specific news or industry‑specific updates.
Push infrastructure must be fast, reliable, resource‑efficient, and provide detailed reporting, A/B testing, and easy integration for developers.
Extended Thoughts
The techniques described are applicable to other domains such as e‑commerce, travel, entertainment, health, and sports, where user models, data pipelines, and recommendation logic share common foundations.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
21CTO
21CTO (21CTO.com) offers developers community, training, and services, making it your go‑to learning and service platform.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
