Big Data 15 min read

Mastering Real‑Time MySQL Binlog Sync with Debezium, Kafka & Hive

This article presents a systematic guide to real‑time MySQL binlog ingestion, outlining three core principles—decoupling from business data, handling schema changes, and ensuring traceability—followed by concrete Debezium‑Kafka‑Hive solutions, scenario‑specific tactics, and practical tips for reliable data pipelines.

dbaplus Community

Sep 1, 2020

Mastering Real‑Time MySQL Binlog Sync with Debezium, Kafka & Hive

1. Core Principles for Binlog Real‑Time Ingestion

Principle 1: Decouple from Business Data – Use a stable timestamp (e.g., ts_ms provided by Debezium) that is independent of any table‑specific time fields, ensuring a consistent partition key for downstream storage.

Principle 2: Decouple from Table Schema – Design the sink format to tolerate column additions, deletions, or renames. JSON is chosen because it does not rely on column order and can represent any schema change without breaking downstream reads.

Principle 3: Ensure Traceability – Both the ingestion side and the consumption side must be able to replay data. On the source side, the binlog position (file, offset) can be reset; on the sink side, offsets can be rewound in Kafka to re‑process records.

2. Technical Architecture and Implementation

The recommended stack is Debezium + Confluent Kafka + OSS/S3 + Hive . Key configuration points:

Enable Debezium’s New Record State Extraction to expose metadata fields such as table, version, connector, ts_ms, file, pos, etc.

Use JSON as the Kafka payload format; it aligns with Debezium’s default output and avoids the column‑order constraints of Parquet/ORC.

For Hive, store the JSON files directly; this sacrifices some read performance but guarantees compatibility with any schema evolution.

When using other capture tools like Canal, the same approach of extracting full metadata and emitting JSON should be applied.

3. Business Scenarios

3.1 Sharding / Multiple Tables – Debezium’s topic routing can direct tables from a single MySQL instance to separate Kafka topics based on regex patterns. A downstream RegexRouter transform merges these topics before Hive ingestion.

3.2 Incremental vs. Full‑Load – Incremental capture is the default. For full‑load, two options are provided:

Set snapshot.mode=when_needed in Debezium to trigger a one‑time snapshot of the whole table.

Combine Sqoop (to import existing data as Parquet) with Debezium incremental streams, then union the datasets in Hive. Note the type mismatches (e.g., datetime vs. bigint) that may require conversion.

3.3 De‑duplication of Late Arriving Records – Use a window function to pick the latest record per primary key based on a composite order of __ts_ms, __file, and __pos:

SELECT *
FROM (
  SELECT *,
    row_number() OVER (PARTITION BY t.id ORDER BY t.__ts_ms DESC, t.__file DESC, CAST(t.__pos AS INT) DESC) AS rn
  FROM test t
  WHERE dt = '{pt}' AND hour = '{now_hour}'
) t1
WHERE rn = 1;

Set the Debezium tombstones.on.delete (or equivalent) to filter out deleted rows, then add a delete = 'false' condition in the outer query.

4. Architectural Summary

The overall design emphasizes simplicity: a minimal number of moving parts reduces operational risk. While Flink was considered, the lack of a mature development‑and‑operations platform and the focus on data transport rather than stream processing led to the choice of Debezium‑Kafka‑Hive.

Key takeaways:

Prefer stable, source‑generated timestamps ( ts_ms) for partitioning.

Store raw change events as JSON to survive schema evolution.

Leverage Kafka’s offset management for both replay and de‑duplication.

Use topic routing and transform operators to handle sharding and merging.

5. Conclusion

The article consolidates practical lessons from real‑world binlog capture projects, offering a repeatable framework that can be adapted to various component choices while preserving the three guiding principles.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Kafka Hive MySQL data ingestion Debezium Real-time Sync

Written by

dbaplus Community

Enterprise-level professional community for Database, BigData, and AIOps. Daily original articles, weekly online tech talks, monthly offline salons, and quarterly XCOPS&DAMS conferences—delivered by industry experts.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.