Architecture Overview of Taobao, Meituan, and Didi Big Data Platforms
This article examines the big‑data architectures of three leading Chinese internet companies—Taobao, Meituan, and Didi—detailing their data sources, synchronization mechanisms, batch and streaming processing layers, and the common scheduling components that unify their Hadoop‑based ecosystems.
In order to learn from the design patterns of large‑scale data platforms, we review the architectures of Taobao, Meituan, and Didi, three of China’s most prominent internet companies. Although each platform reflects its own business scenarios and technology stacks, the overall structure follows a similar three‑tier model: data ingestion, a Hadoop‑based processing cluster, and downstream applications.
Taobao Big Data Platform
Taobao was one of the earliest companies to build an internal big‑data platform. Its architecture consists of three parts: data sources (Oracle, MySQL replicas, log systems, and crawlers) synchronized to Hadoop via a gateway that includes DataExchange (full‑batch sync), DBSync (real‑time incremental sync), and TimeTunnel (real‑time log and crawler sync). All data is stored in HDFS. Jobs are scheduled by the proprietary "Tianwang" scheduler, which orders tasks based on resource availability and priority, and writes results back to HDFS and then to MySQL/Oracle via DataExchange. Downstream services such as recommendation engines read the results directly from the databases.
Meituan Big Data Platform
Meituan’s data originates from MySQL (captured via Canal binlog) and logs (collected by Flume), both fed into a Kafka message queue. Kafka streams are consumed by two processing engines: real‑time stream processing with Storm, whose results are written to HBase or relational databases, and batch processing with Hive, whose results feed BI dashboards. Users and executives access the processed data through BI tools and a custom "Tianji" reporting system. The entire workflow is orchestrated by an internal scheduling platform, and developers use an ETL development portal to build and submit jobs.
Didi Big Data Platform
Didi separates its platform into a real‑time streaming layer and an offline batch layer. In the streaming layer, data is ingested into Kafka, then processed either by Spark Streaming or Flink (ETL) and persisted to HDFS for later batch jobs, or by Druid for real‑time metrics that feed alerting systems and dashboards. The offline layer is built on Hadoop 2 (HDFS, YARN, MapReduce) together with Spark and Hive, complemented by a custom scheduler and a visual SQL editor for job submission.
Didi also heavily utilizes HBase and Phoenix; the results from both streaming and batch computations are stored in HBase, and applications access the data via Phoenix, a SQL layer on top of HBase.
Conclusion
Despite minor differences in product selection and detailed configurations, the three platforms share a common architectural philosophy: a unified scheduling system coordinates data ingestion, processing (both real‑time and batch), and result delivery. Understanding these patterns provides deeper insight into the design of modern big‑data infrastructures.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Architecture Digest
Focusing on Java backend development, covering application architecture from top-tier internet companies (high availability, high performance, high stability), big data, machine learning, Java architecture, and other popular fields.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
