Inside Baidu’s Real-Time Big Data Platforms: Dstream and TM Explained
This article examines Baidu’s home‑grown real‑time big‑data platforms Dstream and TM, detailing their architectures, performance metrics, key features, and practical use cases such as log ETL and real‑time bidding, while highlighting how they meet millisecond‑level processing demands.
Introduction
In the internet industry, data volumes have grown beyond terabytes per day, forcing real‑time analytics to respond within seconds or even milliseconds. Traditional batch processing can no longer satisfy these latency requirements, prompting the adoption of dedicated streaming systems such as Google MillWheel, Twitter Storm, and Spark Streaming.
Baidu’s Real‑Time Computing Platforms
Dstream
Dstream was built before open‑source equivalents like Storm were available. Today its clusters exceed a thousand nodes, handling over 50 TB of data per day and reaching a peak QPS of 1.93 million per second. It targets directed‑acyclic graph (DAG) workloads such as real‑time click‑through‑rate (CTR) calculations, delivering millisecond‑level responses.
TM Platform
Started in 2013, TM operates on a queue‑worker model for near‑real‑time workflows, supporting response times from seconds to minutes and providing transaction semantics that guarantee no data loss or duplication even during failures. A single TM cluster processes more than 30 TB per day with a peak QPS of 200 k. TM’s multi‑window join capability can span days, enabling large‑scale stream joins across Baidu’s advertising logs.
Key Features of TM
Data integrity and timeliness : Guarantees no duplicate or lost records while delivering results as quickly as possible.
Long‑span data handling : Supports arbitrary input delays by persisting streams and performing joins over extended windows.
Generality : Handles diverse join windows and retry mechanisms to satisfy varying latency and completeness requirements.
High reliability and operability : Multi‑cluster and multi‑datacenter backups, dynamic configuration updates, and automatic fault detection enhance uptime.
Application Cases
Log Real‑Time ETL
Baidu’s unified data warehouse receives data via batch Hadoop ETL or the real‑time system UDW‑RT, which is built on top of the underlying streaming platform. UDW‑RT provides a SQL‑like, extensible stream processing engine that supplies real‑time data to downstream services.
The system consists of three layers:
RT‑importer : Cleans, merges, and structures incoming pipe data into infinite logical streams.
RT‑PE : Executes stream operators using a subset of SQL (e.g., UNION, FILTER, PROJECTION) to produce one or more logical streams for downstream subscription.
RT‑exporter : Applications mount this exporter to consume processed streams.
Real‑Time Bidding (RTB)
TM joins two log streams generated by RTB auctions to identify winning bids, feeding results into anti‑fraud, CTR, and billing systems. The architecture decouples front‑end and back‑end services, improving robustness and scalability. Core components include:
Bigpipe : Baidu’s low‑latency distributed messaging system that guarantees no loss or duplication.
Bundler : Subscribes to Bigpipe streams; A_bundler and B_bundler handle different data flows.
Parser : Converts raw text logs to protobuf format.
Aggregator : Merges small files from the parser into larger ones to reduce file count.
Joiner : Performs sliding‑window joins, preserving order within windows and synchronizing timestamps across joiners.
Appender : Publishes joined and non‑joined results back to Bigpipe.
Conclusion
By leveraging Dstream and TM, Baidu has established a suite of high‑throughput, low‑latency data processing solutions that support a growing number of real‑time applications. Ongoing investment aims to deepen research on real‑time big‑data architectures, advancing theory, methods, and system implementations to meet broader market demands.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Baidu Tech Salon
Baidu Tech Salon, organized by Baidu's Technology Management Department, is a monthly offline event that shares cutting‑edge tech trends from Baidu and the industry, providing a free platform for mid‑to‑senior engineers to exchange ideas.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
