Big Data 13 min read

How Didi Scales MySQL‑to‑Hive Sync with Real‑Time Binlog Capture

This article explains Didi's end‑to‑end architecture for ingesting MySQL data into Hive using real‑time Binlog collection, a customized Canal component, message queues, HDFS storage, Dquality monitoring, and strategies for handling data drift and sharding in large‑scale big‑data environments.

dbaplus Community

Jul 4, 2021

How Didi Scales MySQL‑to‑Hive Sync with Real‑Time Binlog Capture

Background

Big data drives modern business strategies, and reliable data collection is the foundation of data‑driven operations. Didi needed a robust way to synchronize MySQL data with Hive for analytics, but traditional Sqoop‑style direct MySQL‑to‑HDFS loads caused database pressure, latency, poor scalability, and lacked change history.

Overall Data Flow

The pipeline consists of a customized Canal component (based on Alibaba's open‑source project) that captures MySQL binlog events, converts them to JSON, and pushes them through a distributed message queue (Kafka/DDMQ) into HDFS. The binlog data is then cleaned and restored into Hive tables.

For incremental ingestion, this process runs continuously. For full‑load or historical backfill, an offline one‑time extraction is performed because historical binlogs are unavailable.

Incremental Process

Collect binlog logs incrementally and store them in HDFS.

Perform an offline full‑load of historical data and restore it to Hive as the baseline.

Merge the baseline with the current incremental binlog to produce a complete dataset for the current period.

Advantages Over Traditional Approaches

Binlog‑based restoration decouples data capture from online services.

Distributed queue ensures real‑time delivery and high scalability.

Binlog records contain detailed DML actions, enabling custom ETL.

Binlog Details

MySQL binlog records every data change. It supports three logging formats:

Statement : logs the SQL statements that modify data.

Row : logs the actual row‑level changes.

Mixed : automatically switches between Statement and Row (default is Statement).

Didi uses the Row format, capturing before‑and‑after values for every column, even if only one column changed.

Canal Component

Canal simulates a MySQL slave, receives binlog streams, parses them into JSON, and forwards the JSON to a message queue.

Imitates MySQL slave protocol and sends a dump request.

MySQL master pushes binary logs to Canal.

Canal parses the logs and encodes them as JSON strings.

JSON messages are sent to Kafka/DDMQ and reported for monitoring.

Example of a formatted JSON record:

{"binlog":"[email protected]","time":1450236307000,"canalTime":1450236308279,"db":"TestCanal","table":"g_order_010","event":"u","columns":[{"n":"order_id","t":"bigint(20)","v":"126","null":false,"updated":false},{"n":"driver_id","t":"bigint(20)","v":"123456","null":false,"updated":false},{"n":"passenger_id","t":"bigint(20)","v":"654321","null":false,"updated":false},{"n":"current_lng","t":"decimal(10,6)","v":"39.021400","null":false,"updated":false},{"n":"current_lat","t":"decimal(10,6)","v":"120.423300","null":false,"updated":false},{"n":"starting_lng","t":"decimal(10,6)","v":"38.128000","null":false,"updated":false},{"n":"starting_lat","t":"decimal(10,6)","v":"121.445000","null":false,"updated":false},{"n":"dest_name","t":"varchar(100)","v":"Renmin University","origin_val":"知春路","null":false,"updated":true}],"keys":["order_id"]}

Dquality Monitoring

Dquality records metadata for each stage of the data channel, providing:

Metadata for data back‑tracking.

Detection of data loss and latency.

Verification of data integrity.

Each sender reports its delivery result and timestamp to Dquality, which aggregates the information, determines whether data has been transmitted timely and accurately, and stores the analysis results for downstream consumption via an API.

One‑Time Full Load & Initialization

Historical data is fetched using DataX to connect directly to the MySQL source, extracting a snapshot up to the current time and loading it into the first Hive partition. Subsequent partitions are built by merging the previous partition’s full data with the current period’s incremental binlog.

Scenario 1: Handling Data Drift

When an order’s update timestamp falls near the hour boundary, the record may be written to the next hour’s partition, causing gaps in hourly statistics. Similarly, payment settlement records may appear in the next day’s partition, breaking same‑day accounting. Didi’s platform allows configuring a time offset (e.g., extending the hour window by 5 minutes) to capture late‑arriving data without delaying downstream processing.

Scenario 2: Supporting Sharded Databases

For services that shard databases by city code, numeric ID, or date, Didi enforces a naming convention that can be automatically recognized. All shards of a logical database are collected into a single Kafka topic, stored in HDFS using a path pattern /{db}/{table}/{year}/{month}/{day}/{hour}, and later restored to Hive. Additional ETL fields are added to Hive tables to preserve update timestamps, delete flags, and a globally unique key composed of database name, table name, and primary key.

Conclusion

The Binlog‑based MySQL‑to‑Hive service underpins Didi’s data platform, supporting over 19,000 daily sync tasks and nearly 50 TB of data with millisecond‑level latency. While the solution meets most real‑time, accurate, and customizable sync needs, challenges remain in personalized ETL, performance tuning, and content governance, which will be addressed in future work.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Big Data Hive mysql Canal

Written by

dbaplus Community

Enterprise-level professional community for Database, BigData, and AIOps. Daily original articles, weekly online tech talks, monthly offline salons, and quarterly XCOPS&DAMS conferences—delivered by industry experts.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.

Background

Overall Data Flow

Incremental Process

Advantages Over Traditional Approaches

Binlog Details

Canal Component

Dquality Monitoring

One‑Time Full Load & Initialization

Scenario 1: Handling Data Drift

Scenario 2: Supporting Sharded Databases

Conclusion

dbaplus Community

How this landed with the community

Was this worth your time?

0 Comments

Scenario 1: Handling Data Drift

Scenario 2: Supporting Sharded Databases