Inside Toutiao’s 11B Daily‑Active‑User Architecture: Data, Recommendations & Scaling

This article dissects Toutiao’s rapid growth from a small startup to a platform with over 5 billion registered users, detailing its data collection pipeline, user‑modeling techniques, recommendation engine, micro‑service architecture, PaaS infrastructure, storage strategies, and push‑notification system.

IT Architects Alliance
IT Architects Alliance
IT Architects Alliance
Inside Toutiao’s 11B Daily‑Active‑User Architecture: Data, Recommendations & Scaling

Product Background

Founded in March 2012, Toutiao expanded from a handful of engineers to over 200 employees within four years, launching product lines such as Jinri Toutiao, Jinri TeMai, and Jinri Movies. The platform now serves more than 5 billion registered users, with 48 million daily active users, 5 billion daily page views, and an average user session exceeding 65 minutes.

Article Crawling and Analysis

Each day Toutiao generates roughly 10 000 original news items from various news sites, blogs, and novels. Engineers build crawlers to collect these articles, after which a manual review filters sensitive content. Automated text analysis extracts classifications, tags, topics, regional information, popularity scores, and weighting factors.

User Modeling

Real‑time user action logs are processed using Scribe, Flume, and Kafka. Interest mining leverages Hadoop and Storm. Model data are persisted in sharded MySQL/MongoDB clusters with read‑write separation, complemented by Memcached / Redis. Key model dimensions include user subscriptions, tags, and partial article push decisions.

Cold‑Start for New Users

When a new user registers, Toutiao captures device type, operating system, app version, and social‑login information (e.g., Weibo). It builds an initial profile from friends, followers, posted content, and interactions, as well as from installed apps, device‑specific usage patterns, and bookmarked browser data.

Recommendation System

The core recommendation engine consists of two parts:

Automatic recommendation : candidate generation, user‑location matching, and automatic push‑task creation, requiring ultra‑high‑throughput delivery to billions of users.

Semi‑automatic recommendation : candidate selection based on in‑app and out‑of‑app actions, with personalized channels (category, interest tags, keywords, text analysis) managed by separate development teams. Over 300 classifiers are in production, and legacy models continue to operate alongside newer ones.

Data Storage

Persistent storage combines MySQL or MongoDB with Memcached / Redis, often using large in‑memory pools and SSDs. Images are stored directly in the database and served via a CDN.

Message Push

Push notifications boost user activity; internal data shows a ~20% increase in DAU after push, while the absence of push can reduce DAU by ~10% (2015 data). ROI metrics include click‑through rate, click volume, app uninstall counts, and push‑disable rates. Personalization dimensions cover frequency, content, geographic location, and user interests. The push platform must be fast, reliable, resource‑efficient, and provide real‑time dashboards, A/B testing, and easy API integration. Large‑scale pushes sometimes leverage public‑cloud services to alleviate bandwidth pressure.

System Architecture Overview

Micro‑Service Architecture

Toutiao decomposes monolithic applications into smaller services, reusing common layers for code sharing. The layered design emphasizes infrastructure that enables rapid iteration, disaster recovery, and independent business team development.

Virtualized PaaS Platform

The platform adopts a three‑layer model: an IaaS layer managing physical machines and public‑cloud resources, a PaaS layer providing unified SaaS services and a generic app execution engine, and an upper layer offering domain‑specific services. Public‑cloud resources are leveraged for bandwidth‑intensive events, abstracted as unified compute resources.

Key Takeaways

Data generation and collection at massive scale.

Data transmission via Kafka as a message bus linking online and offline systems.

Data ingestion through ETL pipelines into data warehouses.

Data computation using three query engine patterns: batch, MPP, and cube, all employed by Toutiao.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

data pipelineRecommendation Enginelarge-scale systemsToutiao
IT Architects Alliance
Written by

IT Architects Alliance

Discussion and exchange on system, internet, large‑scale distributed, high‑availability, and high‑performance architectures, as well as big data, machine learning, AI, and architecture adjustments with internet technologies. Includes real‑world large‑scale architecture case studies. Open to architects who have ideas and enjoy sharing.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.