Inside Toutiao’s Massive Data Pipeline and Real‑Time Recommendation Engine
This article details how Toutiao processes billions of daily page views, builds user models with Hadoop and Storm, runs real‑time recommendation and cold‑start personalization, and scales its microservice‑based architecture using Kafka, MySQL, MongoDB, Redis and a high‑throughput push system.
Overview
Toutiao is a personalized news feed platform that processes millions of articles and billions of user actions daily. Its core infrastructure supports real‑time data collection, large‑scale user modeling, and high‑throughput recommendation and push services.
Article Crawling and Text Analysis
Engineers implement crawlers to ingest roughly 10,000 original news items per day from news sites, blogs, and other sources. After automated collection, a manual review step filters sensitive content. The text pipeline extracts categories, tags, topics, regional relevance, and popularity scores for each article.
User Modeling Pipeline
User actions are streamed in real time using Scribe, Flume and Kafka. Batch learning runs on Hadoop and stream learning on Storm to generate interest vectors. Model data are persisted in sharded MySQL or MongoDB clusters with read‑write separation and cached in Memcached / Redis. By 2015 the modeling cluster comprised ~7,000 machines and captured dimensions such as subscription preferences, tag interests, and partial article push signals.
Cold‑Start for New Users
When a user first opens the app, the system gathers device type, OS version, and social‑login information (e.g., Weibo). It builds an initial profile from the user’s friends, followers, and recent social activity, as well as installed apps, browser bookmarks, and channel subscriptions.
Recommendation Engine
The recommendation engine consists of two complementary subsystems:
Automatic recommendation – generates candidate articles, matches them to user attributes (e.g., location), and creates push tasks automatically.
Semi‑automatic recommendation – selects candidates based on explicit user actions inside and outside the platform.
Channels are organized into classification, interest‑tag, keyword, and text‑analysis groups, managed by >300 classifiers that evolve continuously.
Data Storage
Persistent storage uses a hybrid of MySQL and MongoDB with read‑write separation, complemented by in‑memory caches ( Memcached, Redis). Images are stored in the database and served through a CDN. High‑performance workloads leverage SSD‑based storage.
Message Push Service
Push notifications are delivered via a high‑concurrency pipeline capable of reaching billions of users. Metrics tracked include click‑through rate, click count, app uninstallations, and push disablements. Personalization dimensions cover frequency, content, geographic region, and interest categories.
System Architecture
The platform follows a layered design to enable rapid iteration, disaster recovery, and horizontal scalability.
Messaging bus : Kafka connects online services with offline batch pipelines.
ETL pipelines : Extract‑Transform‑Load jobs move raw logs into data warehouses.
Query engines : Supports batch, MPP (Massively Parallel Processing), and cube‑style analytics for low‑latency reporting.
Key infrastructure components are illustrated below:
Microservice Architecture
Toutiao decomposes large monolithic services into fine‑grained microservices. A common abstraction layer provides reusable code and shared infrastructure (logging, monitoring, configuration). This enables independent team ownership and faster feature rollout.
Virtualization PaaS Platform
The platform is built on a three‑layer model:
IaaS layer – abstracts physical machines and public‑cloud resources, providing unified compute, storage, and network APIs.
PaaS layer – offers common SaaS services (e.g., logging, monitoring) and a generic application execution engine.
SaaS layer – hosts business‑level services such as the recommendation engine and push system.
This architecture allows Toutiao to scale bandwidth‑intensive push campaigns by leveraging both private data centers and public‑cloud capacity.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
ITPUB
Official ITPUB account sharing technical insights, community news, and exciting events.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
