Real-time Binlog Collection and Offline MySQL Data Restoration for Data Warehousing
The article presents a CDC solution that combines Alibaba’s Canal for real‑time MySQL binlog capture into Kafka with LinkedIn’s Camus for hourly Kafka‑to‑Hive loading, then merges snapshots and incremental binlog data to accurately and efficiently rebuild ODS tables, supporting sharding and delete events.
This article introduces a solution for accurately and efficiently loading database (DB) data into a data warehouse by combining real-time Binlog collection with offline Binlog restoration.
Background : In data warehouse modeling, raw business-layer data is called ODS (Operational Data Store). ODS typically includes business log data and business DB data. For DB data, extracting data from relational databases such as MySQL and loading it into Hive is a critical step. Traditional batch extraction‑load (Select → Save to local file → Load to Hive) is simple but suffers from performance bottlenecks, heavy load on MySQL, and inability to handle updates/deletes.
To solve these issues, the article proposes a CDC (Change Data Capture) + Merge approach: real-time Binlog collection followed by offline processing to restore business data.
Overall Architecture : Real-time Binlog collection uses Alibaba’s open‑source project Canal to pull Binlog from MySQL, parse it, and temporarily store it in Kafka. Offline processing uses LinkedIn’s open‑source project Camus to pull Kafka Binlog data into Hive, create snapshots, and merge incremental Binlog data with existing ODS tables.
Real-time Binlog Collection : The process consists of two modules: CanalManager (task assignment, monitoring, metadata management) and Canal/CanalClient (actual collection). CanalManager selects the most suitable MySQL instance, creates a Canal instance, and registers it in ZooKeeper with permanent and temporary nodes for high availability. CanalServer and CanalClient coordinate to deliver Binlog streams to Kafka, one Kafka topic per MySQL DB.
Offline Restoration (Kafka → Hive) :
Camus runs hourly to pull Binlog data from Kafka into Hive.
For each ODS table, a one‑time snapshot of the current MySQL data is taken and loaded into Hive via direct SELECT.
Daily Merge jobs combine the snapshot with the incremental Binlog to reconstruct the latest business data.
Camus Customizations :
Parsing raw Binlog (without schema) into Hive‑compatible schema.
Introducing a “Checkdone” task that verifies the successful completion of hourly Kafka‑to‑Hive jobs before allowing the daily Merge job to start.
Loading Hive partitions after each Kafka‑to‑Hive run.
Merge Process : The Merge job writes incremental Binlog data into a Delta table, then merges Delta with the existing snapshot using the primary key. Updated rows are replaced by Delta rows; unchanged rows are kept. The final result overwrites the original Hive table.
Practice 1 – Supporting Sharding : Binlog from multiple physical MySQL databases can be written to a single Kafka topic, allowing all shards to be merged into one Hive table. Merge jobs use regular‑expression‑based configuration to identify relevant shards, and logical table names are derived from physical names to reduce HDFS small‑file and partition overhead.
Practice 2 – Supporting Delete Events : Since Hive does not support DELETE, the solution extracts Delete events from Binlog, performs a left‑outer join between the existing table (A) and the Delete set (B) on the primary key, and retains rows that do not have a matching Delete record. The retained rows are then processed by the normal Merge workflow.
Summary & Outlook : The Binlog‑based MySQL‑to‑Hive service now covers most business lines at Meituan, providing accurate and efficient data synchronization. Future work includes eliminating the single‑point failure of CanalManager and building cross‑region disaster‑recovery architecture.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Meituan Technology Team
Over 10,000 engineers powering China’s leading lifestyle services e‑commerce platform. Supporting hundreds of millions of consumers, millions of merchants across 2,000+ industries. This is the public channel for the tech teams behind Meituan, Dianping, Meituan Waimai, Meituan Select, and related services.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
