How Toutiao Scales Personalized News: Architecture and Recommendation Engine

This article outlines Toutiao's rapid growth, its data pipelines, user modeling, recommendation system, storage solutions, and push notification strategies that together enable personalized news delivery to hundreds of millions of users.

21CTO
21CTO
21CTO
How Toutiao Scales Personalized News: Architecture and Recommendation Engine

Product Background

Toutiao, founded in March 2012, grew from a handful of engineers to over 200 staff within four years, expanding its product line from short jokes to news, special sales, and movies.

It serves personalized news to users, currently boasting 500 million registered users, 48 million daily active users, and 5 billion daily page views, with users spending over 65 minutes per session.

Technical and Architecture Evolution

Toutiao’s technology stack processes around 10,000 original news articles daily, along with content from various websites, novels, and blogs. Crawlers collect the data, which is then manually filtered for sensitive content.

Text analysis extracts categories, tags, topics, regional information, popularity, and weight for each article.

User Modeling

User actions are logged in real time using tools such as Scribe, Flume, and Kafka. Interest mining leverages Hadoop and Storm, and the resulting models are stored in MySQL/MongoDB (with read‑write separation) and cached in Memcached/Redis.

By 2015 the user‑modeling cluster comprised roughly 7,000 machines, handling dimensions like subscriptions, tags, and article push strategies, enabling continuous recommendation.

Cold Start for New Users

Toutiao identifies new users via device, OS, and app version, and enriches profiles using social logins (e.g., Weibo) to capture friends, followers, and activity.

Additional signals include installed apps, device models, browser bookmarks, and subscribed channels (movies, jokes, products, etc.).

Recommendation System

The core recommendation engine includes automatic and semi‑automatic components:

Automatic: candidate generation, user matching (e.g., location, profile), and push task creation, requiring high‑throughput, massive‑scale delivery.

Semi‑automatic: candidate selection combined with user behavior signals.

Channels are divided into classification, interest tags, keyword, and text‑analysis, supported by over 300 classifiers that continuously evolve.

Data Storage

Persistent storage uses MySQL or MongoDB together with Memcached/Redis, employing many databases and large‑memory instances, and experimenting with SSDs.

Images are stored in the database and served via a CDN.

Message Push

Push notifications increase daily active users by about 20% and their absence can reduce DAU by roughly 10% (2015 data). Key metrics include click‑through rate, click count, app uninstall, and push disable counts.

Push content is personalized by frequency, content, region, and interest, e.g., delivering city‑specific news or industry‑specific updates.

Push infrastructure must be fast, reliable, resource‑efficient, and provide detailed reporting, A/B testing, and easy integration for developers.

Extended Thoughts

The techniques described are applicable to other domains such as e‑commerce, travel, entertainment, health, and sports, where user models, data pipelines, and recommendation logic share common foundations.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

Big DatapersonalizationAIrecommendation systemuser modelingToutiao
21CTO
Written by

21CTO

21CTO (21CTO.com) offers developers community, training, and services, making it your go‑to learning and service platform.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.