Real‑time MySQL Binlog Capture and Offline Hive Restoration for Data Warehouse Production
This article describes a complete solution that uses Alibaba's Canal for real‑time MySQL binlog collection, Kafka for transport, and a customized Camus pipeline to load and merge binlog data into Hive, addressing performance, consistency, and delete‑event challenges in large‑scale data warehousing.
In data‑warehouse modeling, raw business‑layer data stored in an Operational Data Store (ODS) such as MySQL logs or DB tables must be synchronized to Hive for downstream analytics. Traditional batch extraction‑load approaches suffer from performance bottlenecks, heavy MySQL load, and inability to handle updates or deletes.
To overcome these issues, the authors adopt a Change Data Capture (CDC) + Merge architecture: real‑time binlog capture using Alibaba's open‑source Canal, temporary storage in Kafka, and offline processing that restores the data into Hive.
Overall Architecture : Canal Manager assigns and monitors capture tasks, while Canal and CanalClient perform the actual binlog extraction. Captured binlog streams are written to Kafka topics (one per MySQL database). For offline processing, a customized version of LinkedIn's Camus pulls the Kafka data hourly into Hive, creates an initial snapshot of the ODS tables, and then merges incremental binlog changes daily.
Binlog Real‑time Capture : CanalManager selects the optimal MySQL instance, creates a Canal instance, and registers it in ZooKeeper with permanent and temporary nodes for high availability. Two Canal servers (running and standby) ensure failover. CanalClient connects to the running server, receives binlog events, and publishes them to the appropriate Kafka topic.
Offline Restoration (Kafka → Hive) : The Camus‑based job runs under Meituan's ETL framework. It first parses raw binlog data into the target Hive schema, then writes the data to HDFS and loads Hive partitions. A daily Checkdone task verifies that the hourly Kafka‑to‑Hive jobs have completed before triggering the Merge job.
Merge Process : The Merge job creates a Delta table containing the latest changes for the day, then performs a primary‑key‑based merge with the existing ODS table, inserting new rows, updating changed rows, and preserving unchanged rows. The result overwrites the original Hive table.
Practice 1 – Sharding Support : By allowing multiple MySQL databases to write to the same Kafka topic and using regular‑expression‑based configuration in the Merge job, the solution aggregates thousands of sharded tables into a single Hive table, reducing HDFS small‑file and partition overhead.
Practice 2 – Delete Event Handling : Since Hive does not support deletes, the pipeline extracts delete events from binlog, left‑outer‑joins them with the existing ODS data, and retains only rows that are not marked for deletion before applying the standard Merge.
Summary and Outlook : The Binlog‑driven MySQL‑to‑Hive service now powers most of Meituan's business lines, delivering accurate and efficient data synchronization. Future work will focus on eliminating the CanalManager single‑point‑of‑failure and building cross‑region disaster‑recovery capabilities.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Big Data Technology & Architecture
Wang Zhiwu, a big data expert, dedicated to sharing big data technology.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
