Big Data 8 min read

Inside Toutiao’s Massive Data Pipeline: Architecture, Recommendation & Scaling

This article details Toutiao’s rapid growth and its large‑scale data pipeline, covering article crawling, user modeling, recommendation engines, storage solutions, push notifications, micro‑service architecture, and the underlying virtualization PaaS platform that powers its personalized news service.

21CTO
21CTO
21CTO
Inside Toutiao’s Massive Data Pipeline: Architecture, Recommendation & Scaling

1. Product Background

Toutiao was founded in March 2012. In four years it grew from a dozen engineers to over 200 staff, expanding product lines from jokes to news, special sales, movies, and more.

2. Article Crawling and Analysis

Toutiao generates about 10,000 original news articles daily from various sites, plus novels and blogs. Crawlers collect them; sensitive articles are manually filtered. Text analysis includes classification, tagging, topic extraction, and weighting based on region, popularity, and other factors.

3. User Modeling

Real‑time logs of user actions are processed with tools such as Scribe, Flume and Kafka. User interests are learned using Hadoop and Storm. Model data are stored in MySQL/MongoDB (with read‑write separation) and Memcache/Redis. By 2015 the user‑model cluster had about 7,000 machines.

User subscriptions

Tags

Partial article push

4. Cold‑Start for New Users

New users are identified by device, OS, version, and social‑login information (e.g., Weibo). Friend relationships, followers, and content interactions are used to build an initial profile.

5. Recommendation System

The core of Toutiao’s architecture includes automatic and semi‑automatic recommendation.

Automatic recommendation

Candidate generation

Automatic matching based on location and extracted user info

Automatic push task creation

Semi‑automatic recommendation

Automatic candidate selection

Ranking based on user actions inside and outside the app

6. Data Storage

Persistent storage uses MySQL or MongoDB together with Memcached/Redis. Images are stored in the database and served via CDN.

7. Message Push

Push notifications increase DAU by about 20 %; without push, DAU drops ~10 % (2015 data). Metrics include click‑through rate, click volume, app uninstall and push‑disable counts. Push content is personalized by frequency, content, region and interests.

8. System Architecture

Toutiao splits monolithic applications into micro‑services, with a common abstraction layer for code reuse. The architecture consists of three layers: infrastructure, platform services, and business services. Diagrams illustrate the overall layout.

9. Virtualization PaaS Platform

Three‑layer PaaS manages resources: a unified SaaS layer, a generic app execution engine, and an IaaS layer that aggregates public‑cloud resources. This enables high‑bandwidth events to be served efficiently.

10. Summary

Key points: data generation & collection, Kafka as the message bus, ETL pipelines, and three query engine modes (batch, MPP, cube) used for efficient data analysis.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

data pipelineMicroservicesrecommendation systemToutiao
21CTO
Written by

21CTO

21CTO (21CTO.com) offers developers community, training, and services, making it your go‑to learning and service platform.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.