Mastering Change Data Capture: Open‑Source Tools and How to Choose the Right One
This article explains the concept of Change Data Capture (CDC), outlines its common use cases, compares the main technical approaches—including timestamps, data diff, triggers, and log‑based methods—and reviews popular open‑source CDC solutions and their database‑specific configuration requirements.
What is CDC?
Change Data Capture (CDC) records every data modification—INSERT, UPDATE, DELETE—in a source system and streams those changes to downstream consumers for real‑time analytics, ETL, data synchronization, and related use cases.
Typical CDC use cases
Real‑time data warehousing
ETL pipelines
Read/write separation
Application data merging
One‑to‑many data distribution
Cross‑platform disaster‑recovery replication
Rolling upgrades of data platforms
Technical approaches to CDC
Table‑level timestamps or version columns Some applications add columns such as last_update , date_modified , or a deleted flag. Queries on these columns can identify changed rows, but the approach requires schema changes and can cause tables to grow indefinitely because deletions are logical.
Data‑diff comparison Full‑table snapshots are compared to detect changes. This method is CPU‑ and I/O‑intensive, captures only the final state within a window, and cannot emit intermediate change events.
Table‑level triggers INSERT/UPDATE/DELETE triggers write each change to a history table, enabling precise incremental extraction. Triggers add write‑path overhead and increase storage consumption.
Log‑based change extraction Parsing the database’s transaction log (binlog, redo log, WAL, etc.) provides low‑latency, non‑intrusive capture and works across heterogeneous databases.
Database‑specific log‑based CDC requirements
MySQL
CDC relies on the binary log (binlog). Required MySQL version is 5.7 or higher. Binlog formats:
Statement‑Based Replication (SBR)
Row‑Based Replication (RBR) – recommended for reliable change capture
Mixed‑Based Replication (MBR)
Note: Binary columns such as TINYBLOB, BLOB, MEDIUMBLOB, LONGBLOB are not captured by some CDC tools.
Oracle
Oracle CDC uses redo logs. Archive mode must be enabled and supplemental (additional) logs must be turned on to guarantee capture. Oracle 19c disables continuous_mine; Oracle GoldenGate is the recommended alternative for continuous log mining.
PostgreSQL
Logical decoding (available from PostgreSQL 10) reads WAL entries and streams changes via logical replication slots. Consumers can use pg_recvlogical to read the slots.
MongoDB
Change Streams (available from MongoDB 3.6) listen to the oplog and emit real‑time change events for collections, databases, or entire clusters. This feature requires a replica‑set deployment.
Popular open‑source CDC tools
Debezium
Debezium captures changes via Kafka Connect, turning each table’s changes into a dedicated Kafka topic. It supports exactly‑once or at‑least‑once delivery semantics. Project URL:
https://github.com/debezium/debeziumCanal
Canal, an Alibaba project, mimics a MySQL slave to pull binlog events. It supports MySQL 5.1‑8.0 and MariaDB. Project URL:
https://github.com/alibaba/canalMaxwell
Maxwell reads MySQL binlog and emits JSON messages to Kafka, Kinesis, RabbitMQ, Redis, Google Cloud Pub/Sub, files, etc. It supports full‑table initialization and automatic GTID recovery after failover. Project URL:
https://github.com/zendesk/maxwellFlink CDC
Flink CDC provides source connectors for MySQL, PostgreSQL, Oracle, MongoDB and others, enabling both snapshot and incremental change ingestion within Flink jobs. It leverages Flink’s parallelism, state backends, and ecosystem. Project URL:
https://github.com/ververica/flink-cdc-connectorsTapData
TapData is an open‑source real‑time data service platform that offers non‑intrusive CDC‑based data collection, automatic schema inference, unified streaming‑batch processing, and model publishing. Documentation:
https://tapdata.github.io/Conclusion
Log‑based CDC has become the de‑facto method for real‑time data pipelines because it imposes minimal load on source systems and works across diverse databases. Selecting a tool depends on the target ecosystem (Kafka, Flink, etc.), required delivery guarantees, and operational constraints such as supported database versions.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
ITPUB
Official ITPUB account sharing technical insights, community news, and exciting events.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
