Inside Toutiao’s Massive Big Data & Recommendation Architecture
This article examines Toutiao’s rapid growth from a small startup to a platform serving over 500 million users, detailing its data collection, user modeling, cold‑start handling, recommendation engines, storage solutions, messaging push system, micro‑service design, and virtualized PaaS infrastructure that enable high‑throughput, personalized news delivery.
Product Background
Toutiao, founded in March 2012, grew from a dozen engineers to over 200 staff within four years, launching products such as Jinri Toutiao, Jinri Teshou, Jinri Dianying, and others.
Key statistics (combined internal and public data): 500 million registered users (1.5 million in May 2014, 300 million in May 2015, 500 million in May 2016); 48 million daily active users (10 million in 2014, 30 million in 2015); 5 billion page views per day (5 billion article views, 1 billion video views); over 30 billion page requests; average user session exceeds 65 minutes.
Article Crawling and Analysis
Roughly 10 k original news articles are generated daily from various news sites, blogs, novels, etc. Crawlers collect them; sensitive content is manually filtered. Text analysis includes classification, tagging, topic extraction, regional and popularity weighting.
User Modeling
User actions are logged and processed in real time using Scribe, Flume, Kafka, Hadoop, and Storm. The resulting models are stored in MySQL/MongoDB (read‑write split) and Memcached/Redis. By 2015 the cluster comprised about 7 k machines. Models cover subscription, tags, and article shuffling, requiring continuous recommendation.
Cold‑Start for New Users
New users are profiled based on device, OS, app version, and social login information (e.g., Weibo). Attributes such as follower relationships, user tags, and installed apps are used to build an initial portrait.
Recommendation System
The core recommendation engine consists of automatic and semi‑automatic pipelines. Automatic recommendation generates candidates, matches users (including location), and creates push tasks, demanding high‑throughput delivery to billions of users. Semi‑automatic recommendation selects candidates based on in‑app and out‑of‑app actions. Over 300 classifiers and numerous user models are maintained.
Data Storage
Persistent storage uses MySQL or MongoDB together with Memcached/Redis, often with large in‑memory databases and SSDs. Images are stored in the database and distributed via CDN.
Message Push
Push notifications increase DAU by about 20 % and their absence reduces DAU by ~10 % (2015 data). Metrics such as click‑through rate, click count, app uninstall and push disable rates are monitored. Push content is personalized by frequency, content, region, and interest, with examples targeting specific cities or interests.
Push infrastructure requires fast, reliable, resource‑efficient channels, A/B testing support, and a backend that provides daily reports.
System Architecture
Micro‑service Architecture
Toutiao decomposes large applications into smaller services, reusing common layers. The layered architecture emphasizes infrastructure to enable rapid iteration, fault tolerance, and easier business‑level changes.
Virtualized PaaS Platform
A three‑layer PaaS platform manages resources uniformly, offering SaaS services and a generic app execution engine atop an IaaS layer. Public cloud resources are abstracted to handle high‑bandwidth events, while logging, monitoring, and other services are provided as infrastructure capabilities.
Summary
Key components of Toutiao’s platform include data generation and collection, Kafka‑based message bus linking online and offline systems, ETL pipelines, and data warehouses. Query engines span batch, MPP, and cube processing, all supporting efficient analytics for personalized news delivery.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
ITFLY8 Architecture Home
ITFLY8 Architecture Home - focused on architecture knowledge sharing and exchange, covering project management and product design. Includes large-scale distributed website architecture (high performance, high availability, caching, message queues...), design patterns, architecture patterns, big data, project management (SCRUM, PMP, Prince2), product design, and more.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
