Technical Architecture and Data Processing of Toutiao News Feed System
This article provides a comprehensive overview of Toutiao's rapid growth, massive user base, data collection pipelines, user modeling, recommendation engine, storage solutions, message push strategies, micro‑service architecture, and virtualization PaaS platform, illustrating how big‑data technologies enable personalized news delivery at scale.
1. Product Background
Toutiao, founded in 2012, grew from a few engineers to over 200 staff, offering products such as a news feed, e‑commerce, and video services.
2. Data Overview
As of the article, Toutiao has 5 billion registered users, 4.8 million daily active users, 5 billion page views per day, and an average user session time exceeding 65 minutes.
3. Article Crawling and Analysis
Approximately 10 k original articles are generated daily; crawlers collect news, novels, and blogs, which are then manually filtered for sensitive content and processed for classification, tagging, and topic extraction.
4. User Modeling
User actions are logged in real time using Scribe, Flume, Kafka, Hadoop, Storm and stored in MySQL/MongoDB with Redis/Memcached caches. Models include subscriptions, tags, partial article pushes and are built on a cluster of thousands of machines.
5. Cold‑Start for New Users
New users are profiled using device information, OS version and social‑account data (followers, posts, comments) to generate initial interest vectors.
6. Recommendation System
The core recommendation engine consists of automatic and semi‑automatic pipelines that generate candidates, match users, create push tasks and deliver personalized content at massive scale.
7. Data Storage
Persistent storage relies on MySQL or MongoDB with read/write separation, complemented by large‑memory Redis caches and CDN‑backed image storage.
8. Message Push
Push notifications increase DAU by ~20 %; the system tracks ROI, click‑through rates and supports frequency, content, regional and interest personalization.
9. System Architecture
Toutiao adopts a layered micro‑service architecture, separating infrastructure, common services and business modules, and runs on a hybrid private‑cloud/IDC environment.
10. Virtualization PaaS Platform
A three‑layer PaaS abstracts IaaS resources, providing SaaS services and a generic app execution engine for rapid iteration and fault tolerance.
11. Summary
The key components are data generation & collection, Kafka‑based messaging, ETL into data warehouses, and batch/MPP/Cube query engines that enable efficient analytics and personalized recommendation.
Top Architect
Top Architect focuses on sharing practical architecture knowledge, covering enterprise, system, website, large‑scale distributed, and high‑availability architectures, plus architecture adjustments using internet technologies. We welcome idea‑driven, sharing‑oriented architects to exchange and learn together.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.