Big Data 9 min read

Big Data Technology and Architecture: Case Studies of Taobao, Didi, and Meituan

This article reviews the evolution and key components of big data platforms at leading Chinese internet companies—Taobao, Didi, and Meituan—detailing their data sources, synchronization tools, storage layers, processing engines, and scheduling systems to provide practical guidance for building robust big data infrastructures.

Big Data Technology & Architecture

Sep 11, 2019

Big Data Technology and Architecture: Case Studies of Taobao, Didi, and Meituan

Big data platforms are built to store, compute, and present the ever‑growing volume of data generated by modern society. They encompass technologies such as massively parallel processing (MPP) databases, data‑mining, distributed file systems, distributed databases, cloud computing platforms, the Internet, and scalable storage systems.

The emergence of big data platforms follows the continuous growth of business, data volume, and analytical needs. This article examines the development histories of the big data platforms of three Chinese internet giants—Taobao, Didi, and Meituan—to offer basic ideas for constructing a big data platform.

Taobao

Taobao was one of the earliest companies to build its own big data platform. Its early Hadoop architecture is shown below.

The platform consists of three layers: data sources and synchronization at the top, the "Yunti" Hadoop cluster in the middle, and downstream applications that consume the computation results.

Data sources include Oracle and MySQL replicas, log systems, and crawlers. Data is imported into Hadoop via a data‑exchange gateway using tools such as DataExchange (full‑batch sync), DBSync (real‑time incremental sync), and TimeTunnel (real‑time log and crawler sync), all stored in HDFS.

Computation tasks are scheduled by the "Tianwang" scheduler, which manages job priority and resource allocation. Results are written back to HDFS and then synchronized to MySQL and Oracle for use by recommendation systems and other applications.

Taobao’s internal data‑synchronization components—DBSync, TimeTunnel, and DataExchange—are open‑source and can be referenced for similar implementations.

Didi

Didi’s big data platform has evolved through three stages: self‑built small clusters, centralized large clusters with platformization, and finally a SQL‑centric architecture.

The offline platform is built on Hadoop 2 (HDFS, YARN, MapReduce), Spark, and Hive, with a custom scheduler and development environment that provides a visual SQL editor.

Didi also heavily uses HBase and Phoenix, developing custom extensions and maintaining a dedicated HBase platform for both real‑time and batch workloads.

Real‑time computation results are stored in HBase and accessed via Phoenix, while a StreamSQL IDE, monitoring, diagnostics, lineage, and task control features support streaming jobs.

Meituan

Meituan’s data platform ingests data from MySQL (via Canal) and logs (via Flume) into Kafka, which feeds both streaming (Storm) and batch (Hive) processing pipelines.

Streaming results are written to HBase or relational databases, while batch results are stored in ODPS and accessed through BI tools and internal reporting systems.

The offline layer runs on YARN, HDFS, and HiveMeta, with Hive, Spark, and Presto providing data warehousing, mining, and ad‑hoc query capabilities. Meituan has migrated its real‑time warehouse from Storm to Flink, leveraging Flink’s SQL support, fault tolerance, and state management.

Overall, the three case studies illustrate how data sources, synchronization mechanisms, storage, processing engines, and scheduling systems combine to form a cohesive big data platform, offering practical reference for engineers building similar infrastructures.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

architecture Big Data Streaming Data Platform ETL Hadoop

Written by

Big Data Technology & Architecture

Wang Zhiwu, a big data expert, dedicated to sharing big data technology.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.