Inside Toutiao’s Massive Data Pipeline: Architecture, Recommendation & Scaling
This article details Toutiao’s rapid growth and its large‑scale data pipeline, covering article crawling, user modeling, recommendation engines, storage solutions, push notifications, micro‑service architecture, and the underlying virtualization PaaS platform that powers its personalized news service.
1. Product Background
Toutiao was founded in March 2012. In four years it grew from a dozen engineers to over 200 staff, expanding product lines from jokes to news, special sales, movies, and more.
2. Article Crawling and Analysis
Toutiao generates about 10,000 original news articles daily from various sites, plus novels and blogs. Crawlers collect them; sensitive articles are manually filtered. Text analysis includes classification, tagging, topic extraction, and weighting based on region, popularity, and other factors.
3. User Modeling
Real‑time logs of user actions are processed with tools such as Scribe, Flume and Kafka. User interests are learned using Hadoop and Storm. Model data are stored in MySQL/MongoDB (with read‑write separation) and Memcache/Redis. By 2015 the user‑model cluster had about 7,000 machines.
User subscriptions
Tags
Partial article push
4. Cold‑Start for New Users
New users are identified by device, OS, version, and social‑login information (e.g., Weibo). Friend relationships, followers, and content interactions are used to build an initial profile.
5. Recommendation System
The core of Toutiao’s architecture includes automatic and semi‑automatic recommendation.
Automatic recommendation
Candidate generation
Automatic matching based on location and extracted user info
Automatic push task creation
Semi‑automatic recommendation
Automatic candidate selection
Ranking based on user actions inside and outside the app
6. Data Storage
Persistent storage uses MySQL or MongoDB together with Memcached/Redis. Images are stored in the database and served via CDN.
7. Message Push
Push notifications increase DAU by about 20 %; without push, DAU drops ~10 % (2015 data). Metrics include click‑through rate, click volume, app uninstall and push‑disable counts. Push content is personalized by frequency, content, region and interests.
8. System Architecture
Toutiao splits monolithic applications into micro‑services, with a common abstraction layer for code reuse. The architecture consists of three layers: infrastructure, platform services, and business services. Diagrams illustrate the overall layout.
9. Virtualization PaaS Platform
Three‑layer PaaS manages resources: a unified SaaS layer, a generic app execution engine, and an IaaS layer that aggregates public‑cloud resources. This enables high‑bandwidth events to be served efficiently.
10. Summary
Key points: data generation & collection, Kafka as the message bus, ETL pipelines, and three query engine modes (batch, MPP, cube) used for efficient data analysis.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
21CTO
21CTO (21CTO.com) offers developers community, training, and services, making it your go‑to learning and service platform.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
