MySQL Binlog Real‑time Collection and Hive Ingestion at DiDi: Architecture and Practices
DiDi’s real‑time MySQL‑to‑Hive pipeline captures row‑mode binlog with a custom Canal component, converts it to JSON, streams it via Kafka to HDFS, restores it into Hive tables, and uses Dquality for integrity, achieving millisecond latency for over 19,000 daily sync tasks handling roughly 50 TB of data.
DiDi’s data platform treats the massive data engine of the era as a core strategic asset. This article shares the architecture and practical experience of ingesting MySQL data sources into Hive, focusing on the use of Binlog for real‑time collection and offline restoration.
Background
Relational databases are the foundation of data analysis. Synchronizing MySQL data to Hive is a prerequisite for many enterprises, but direct MySQL‑to‑Hive approaches (e.g., Sqoop) suffer from high DB load, long full‑table latency, poor scalability for schema changes, and lack of incremental change tracking.
To address these issues, DiDi adopts a Binlog‑based real‑time collection plus offline restoration solution.
Overall Data Flow
The pipeline consists of:
Custom Canal component (based on Alibaba’s open‑source project) captures MySQL Binlog and converts it to JSON.
The JSON records are sent through a distributed message queue (e.g., Kafka, DDMQ) to HDFS.
On HDFS, Binlog files are cleaned and restored into Hive tables.
For incremental ingestion, the above steps constitute one full load cycle. For full‑historical back‑fill, an offline one‑time extraction (using DataX) pulls a snapshot of MySQL tables, restores it to a base Hive partition, and then merges subsequent Binlog increments.
Advantages over traditional Sqoop‑style solutions
Binlog‑based restoration decouples data capture from online services, avoiding DB pressure.
Distributed queue ensures low latency and strong scalability.
Row‑level Binlog records preserve insert, update, and delete details, enabling fine‑grained ETL.
Binlog Details
MySQL Binlog is a binary log that records data‑changing statements. It supports three formats:
Statement mode : each DML statement (INSERT, UPDATE, DELETE) is logged.
Row mode : each row‑level change is logged.
Mixed mode : automatically switches between Statement and Row based on the statement.
DiDi uses Row mode, which records the before‑and‑after values of every column, even if a column is unchanged.
Canal Component
Canal simulates a MySQL slave, receives Binlog dump from the master, parses the binary log into JSON, and pushes the JSON to a message queue. A sample JSON message looks like:
{
"binlog": "[email protected]",
"time": 1450236307000,
"canalTime": 1450236308279,
"db": "TestCanal",
"table": "g_order_010",
"event": "u",
"columns": [
{"n": "order_id", "t": "bigint(20)", "v": "126", "null": false, "updated": false},
{"n": "driver_id", "t": "bigint(20)", "v": "123456", "null": false, "updated": false},
{"n": "passenger_id", "t": "bigint(20)", "v": "654321", "null": false, "updated": false},
{"n": "current_lng", "t": "decimal(10,6)", "v": "39.021400", "null": false, "updated": false},
{"n": "current_lat", "t": "decimal(10,6)", "v": "120.423300", "null": false, "updated": false},
{"n": "starting_lng", "t": "decimal(10,6)", "v": "38.128000", "null": false, "updated": false},
{"n": "starting_lat", "t": "decimal(10,6)", "v": "121.445000", "null": false, "updated": false},
{"n": "dest_name", "t": "varchar(100)", "v": "Renmin University", "origin_val": "知春路", "null": false, "updated": true}
],
"keys": ["order_id"]
}
{ ... second record ... }Dquality Service
To guarantee data integrity across the Binlog pipeline, DiDi introduced Dquality, which records metadata for each stage, detects data loss or delay, and provides an API for downstream ETL to verify data readiness.
Handling Data Drift and Sharding
Two real‑world scenarios illustrate the need for configurable offset windows (data drift) and support for sharded tables. DiDi’s solution includes:
Standardized MySQL naming conventions for automatic sharding detection.
Collecting all tables of a logical database into a single Kafka topic.
Storing Binlog files in HDFS using a path pattern /{db}/{table}/{year}/{month}/{day}/{hour}.
During ETL, adding three extra fields to Hive tables: system_rule_etl_update_field, system_rule_etl_delete_flag, and system_rule_etl_uniq_key to preserve update timestamps, delete flags, and global primary keys.
Summary
The Binlog‑based MySQL‑to‑Hive service now supports over 19,000 daily sync tasks, handling nearly 50 TB of data with millisecond‑level latency. While the platform meets most real‑time synchronization needs, further work is planned on custom ETL, performance optimization, and content enrichment.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
