Big Data 13 min read

MySQL Binlog Real‑time Collection and Hive Ingestion at DiDi: Architecture and Practices

DiDi’s real‑time MySQL‑to‑Hive pipeline captures row‑mode binlog with a custom Canal component, converts it to JSON, streams it via Kafka to HDFS, restores it into Hive tables, and uses Dquality for integrity, achieving millisecond latency for over 19,000 daily sync tasks handling roughly 50 TB of data.

Didi Tech
Didi Tech
Didi Tech
MySQL Binlog Real‑time Collection and Hive Ingestion at DiDi: Architecture and Practices

DiDi’s data platform treats the massive data engine of the era as a core strategic asset. This article shares the architecture and practical experience of ingesting MySQL data sources into Hive, focusing on the use of Binlog for real‑time collection and offline restoration.

Background

Relational databases are the foundation of data analysis. Synchronizing MySQL data to Hive is a prerequisite for many enterprises, but direct MySQL‑to‑Hive approaches (e.g., Sqoop) suffer from high DB load, long full‑table latency, poor scalability for schema changes, and lack of incremental change tracking.

To address these issues, DiDi adopts a Binlog‑based real‑time collection plus offline restoration solution.

Overall Data Flow

The pipeline consists of:

Custom Canal component (based on Alibaba’s open‑source project) captures MySQL Binlog and converts it to JSON.

The JSON records are sent through a distributed message queue (e.g., Kafka, DDMQ) to HDFS.

On HDFS, Binlog files are cleaned and restored into Hive tables.

For incremental ingestion, the above steps constitute one full load cycle. For full‑historical back‑fill, an offline one‑time extraction (using DataX) pulls a snapshot of MySQL tables, restores it to a base Hive partition, and then merges subsequent Binlog increments.

Advantages over traditional Sqoop‑style solutions

Binlog‑based restoration decouples data capture from online services, avoiding DB pressure.

Distributed queue ensures low latency and strong scalability.

Row‑level Binlog records preserve insert, update, and delete details, enabling fine‑grained ETL.

Binlog Details

MySQL Binlog is a binary log that records data‑changing statements. It supports three formats:

Statement mode : each DML statement (INSERT, UPDATE, DELETE) is logged.

Row mode : each row‑level change is logged.

Mixed mode : automatically switches between Statement and Row based on the statement.

DiDi uses Row mode, which records the before‑and‑after values of every column, even if a column is unchanged.

Canal Component

Canal simulates a MySQL slave, receives Binlog dump from the master, parses the binary log into JSON, and pushes the JSON to a message queue. A sample JSON message looks like:

{
  "binlog": "[email protected]",
  "time": 1450236307000,
  "canalTime": 1450236308279,
  "db": "TestCanal",
  "table": "g_order_010",
  "event": "u",
  "columns": [
    {"n": "order_id", "t": "bigint(20)", "v": "126", "null": false, "updated": false},
    {"n": "driver_id", "t": "bigint(20)", "v": "123456", "null": false, "updated": false},
    {"n": "passenger_id", "t": "bigint(20)", "v": "654321", "null": false, "updated": false},
    {"n": "current_lng", "t": "decimal(10,6)", "v": "39.021400", "null": false, "updated": false},
    {"n": "current_lat", "t": "decimal(10,6)", "v": "120.423300", "null": false, "updated": false},
    {"n": "starting_lng", "t": "decimal(10,6)", "v": "38.128000", "null": false, "updated": false},
    {"n": "starting_lat", "t": "decimal(10,6)", "v": "121.445000", "null": false, "updated": false},
    {"n": "dest_name", "t": "varchar(100)", "v": "Renmin University", "origin_val": "知春路", "null": false, "updated": true}
  ],
  "keys": ["order_id"]
}
{ ... second record ... }

Dquality Service

To guarantee data integrity across the Binlog pipeline, DiDi introduced Dquality, which records metadata for each stage, detects data loss or delay, and provides an API for downstream ETL to verify data readiness.

Handling Data Drift and Sharding

Two real‑world scenarios illustrate the need for configurable offset windows (data drift) and support for sharded tables. DiDi’s solution includes:

Standardized MySQL naming conventions for automatic sharding detection.

Collecting all tables of a logical database into a single Kafka topic.

Storing Binlog files in HDFS using a path pattern /{db}/{table}/{year}/{month}/{day}/{hour}.

During ETL, adding three extra fields to Hive tables: system_rule_etl_update_field, system_rule_etl_delete_flag, and system_rule_etl_uniq_key to preserve update timestamps, delete flags, and global primary keys.

Summary

The Binlog‑based MySQL‑to‑Hive service now supports over 19,000 daily sync tasks, handling nearly 50 TB of data with millisecond‑level latency. While the platform meets most real‑time synchronization needs, further work is planned on custom ETL, performance optimization, and content enrichment.

Big DataHiveMySQLbinlogCanalETL
Didi Tech
Written by

Didi Tech

Official Didi technology account

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.