Big Data 15 min read

Xiaomi Streaming Platform: Evolution, Architecture, and Flink‑Based Real‑Time Data Warehouse

The article details Xiaomi's unified streaming data platform, its three‑generation evolution from Scribe/Kafka/Storm to Talos and Flink, the current architecture supporting billions of records daily, and future plans to unify offline and real‑time warehousing with Flink SQL.

Big Data Technology Architecture
Big Data Technology Architecture
Big Data Technology Architecture
Xiaomi Streaming Platform: Evolution, Architecture, and Flink‑Based Real‑Time Data Warehouse

Xiaomi operates numerous business lines (information flow, e‑commerce, advertising, finance) and built a unified streaming data platform that provides data collection, integration, and real‑time computation, handling about 1.2 trillion records per day, 15 k real‑time sync tasks, and 1 trillion compute records.

The platform has undergone three major upgrades; the latest iteration is based on Apache Flink, replacing the previous Spark Streaming implementation.

Key functional modules include streaming data storage (a proprietary message queue called Talos, similar to Kafka), data ingestion and dumping, and data processing using Flink, Spark Streaming, and Storm.

The overall architecture consists of data sources (user logs and databases such as MySQL, HBase), Talos as the central message queue with Consumer SDK and Producer SDK, Talos Source for full‑scene data collection, and Talos Sink for low‑latency data export; future work will rebuild Talos Sink with Flink SQL.

Business scale is massive: ~1.2 trillion messages per day, peak traffic of 43 million messages per second, 1.5 k dump jobs moving 1.6 PB daily, over 800 streaming compute jobs (200+ Flink jobs) processing more than 700 billion messages per day.

Platform history: Streaming Platform 1.0 (2010) used Scribe, Kafka, Storm; 2.0 introduced Talos, Spark Streaming, and a star‑topology with multi‑source/multi‑sink, configuration and package management, and end‑to‑end monitoring; 3.0 added full‑link schema support, Flink, and Stream SQL.

Limitations of 2.0 (lack of schema management, non‑customizable Talos Sink, Spark Streaming’s missing event‑time and exactly‑once semantics) motivated a migration to Flink.

The Flink‑based redesign focuses on full‑link schema validation, leveraging Flink community to migrate jobs, productizing streaming with job and SQL management, and rebuilding Talos Sink using Flink SQL for custom business logic.

Job management now offers full lifecycle, permission, tagging, history, status, and latency monitoring, with automatic restart of failed jobs.

SQL management workflow: external tables → SQL DDL (Table Schema, Format, Connector) → SQL Config (DDL + DML) → Job Config (resource & Flink state settings) → JobGraph (submitted to Flink cluster). This includes automatic schema conversion, connector property handling, and job graph generation.

Talos Sink supports three modes: Row (direct write), ID mapping (field mapping), and SQL (logic expressed via SQL).

Future plans include continued Flink streaming job and platform development, unifying offline and real‑time warehouses with Flink SQL, schema‑based data lineage and governance, and active participation in the Flink open‑source community.

Author: Xia Jun, head of Xiaomi Streaming Platform, responsible for streaming computation, message queues, and big‑data integration, working with Flink, Spark Streaming, Storm, Kafka, and related in‑house systems.

Big DataReal-time ProcessingFlinkdata warehouseXiaomiStreaming Platform
Big Data Technology Architecture
Written by

Big Data Technology Architecture

Exploring Open Source Big Data and AI Technologies

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.