How Meituan Syncs MySQL to Hive in Real-Time Using Binlog, Canal, and Camus
This article explains Meituan's architecture for accurately and efficiently moving MySQL data into a Hive data warehouse by capturing binlog streams with Canal, transporting them via Kafka, and restoring them offline with Camus and a merge process that handles inserts, updates, and deletes.
Background
Traditional batch extraction of MySQL data to Hive (SELECT → local file → LOAD) becomes a performance bottleneck as data volume grows, overloads the MySQL source, and cannot handle UPDATE/DELETE because Hive lacks those primitives. To address these issues, Meituan built a CDC + Merge pipeline that captures MySQL binlog in real time and replays it offline to reconstruct the business data in Hive.
Overall Architecture
The pipeline consists of two main stages:
Real‑time binlog collection using Alibaba’s open‑source Canal. Binlog events are written to Kafka topics, one topic per MySQL database.
Offline restoration that pulls the Kafka streams into Hive via a customized version of LinkedIn’s Camus, followed by a daily Merge that applies the incremental changes to the existing Hive snapshot.
Binlog Real‑Time Collection
CanalManager receives a binlog‑capture request, queries the DBA platform for the optimal MySQL instance, creates a Canal Instance, and assigns it to a CanalServer based on load‑balancing and cross‑region considerations.
Each CanalServer registers the instance in ZooKeeper with a permanent node (instance name) and a temporary node (ip:port). This enables high availability: a standby server takes over if the running server fails, and CanalClient always connects to the active server.
Binlog subscription is at the database level; all databases on the same MySQL instance share a single Canal Instance, which filters out events for un‑subscribed databases before forwarding the rest to the corresponding Kafka topics.
Offline Restoration (Kafka → Hive)
The offline stage uses a customized Camus implementation to pull binlog data from Kafka into Hive on an hourly basis.
Kafka2Hive Tasks
Each hourly task reads a specific Kafka topic, parses the raw binlog into a Hive‑compatible schema, and writes the data into the original_binlog database. The table is partitioned by db_name and date, and the corresponding Hive partitions are loaded so downstream queries can see the data immediately.
Because the ETL framework separates queues by granularity, a Checkdone daily task scans the HDFS directories created by the hourly jobs. If the latest timestamp for the previous day exceeds midnight, the hourly jobs are considered successful and the daily Merge can be triggered.
Merge Process
After the binlog data lands in Hive, a Merge job performs two steps:
Create a Delta table that contains only the latest change for each primary key on that day. If a row is updated multiple times within the day, only the last version is kept.
Merge the Delta table with the existing snapshot using the primary key as the unique identifier. The result overwrites the original Hive table with INSERT OVERWRITE.
Merge Example
Assume a table with columns id (primary key) and value. The Delta table stores the most recent value for each id. After the Merge, new rows are inserted, rows with updated id are replaced, and unchanged rows remain. By default the MySQL primary key is used as the unique key, but this can be overridden in the job configuration.
Practice 1: Supporting Sharding
When many MySQL shards exist, binlog from multiple databases can be written to the same Kafka topic. During Merge, a regular‑expression configuration tells the job which logical tables to aggregate, allowing all shards to be merged into a single Hive table.
An additional optimization rewrites physical table names (e.g., userinfo123) to logical names ( userinfo) before writing to Hive. This reduces the number of small HDFS files and Hive partitions.
Practice 2: Supporting Delete Events
Hive does not support DELETE, so deletions are handled in two steps:
Extract delete events from the binlog (Table B) and left‑outer‑join the existing snapshot (Table A) on the primary key. Rows that have no match in Table B are retained; rows that match are considered deleted.
Run the regular Merge on the retained dataset to produce the final Hive table.
This effectively removes rows that were deleted in MySQL.
Conclusion and Outlook
The binlog‑based MySQL‑to‑Hive service now covers most internal business lines at Meituan, providing accurate and efficient data ingestion. Future work will eliminate the single‑point nature of CanalManager and add cross‑region disaster‑recovery capabilities to further improve stability.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
ITPUB
Official ITPUB account sharing technical insights, community news, and exciting events.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
