Big Data 17 min read

Real‑time MySQL Binlog Capture with Canal: Principles, Architecture, Deployment and Comparison with Maxwell

This article explains how to use Alibaba's Canal to capture MySQL binlog changes in real time, covering its underlying protocol, component architecture, HA design with ZooKeeper, configuration steps, deployment examples, and a detailed comparison with alternative tools such as Maxwell and mysql_streamer.

Big Data Technology & Architecture

Feb 10, 2020

Real‑time MySQL Binlog Capture with Canal: Principles, Architecture, Deployment and Comparison with Maxwell

Companies that need real‑time data synchronization often face the problem of incremental MySQL binlog capture; Canal is a popular open‑source solution for this scenario.

Canal Overview

Principle

Canal simulates a MySQL slave, sends a dump request to the master, receives binary log streams, parses the byte‑level log entries, and forwards them to downstream components.

Architecture

The system consists of a Server (one JVM instance) that can host multiple Instance objects, each representing a data queue. An instance is composed of EventParser, EventSink, EventStore and MetaManager.

Component Details

Server: a Canal runtime instance (one JVM).

Instance: a logical data queue; a server may host many instances.

EventParser: reads the last processed binlog position, issues a dump request, and parses the incoming binlog.

EventSink: acts as a channel, performing filtering, routing (1:n), merging (n:1) and transformation before passing data to the store.

EventStore: an in‑memory circular queue identified by Put, Get and Ack pointers.

MetaManager: manages incremental subscription and consumption metadata, providing getWithoutAck, ack and rollback APIs.

HA Mechanism

Canal achieves high availability via ZooKeeper. Server HA ensures that only one instance of a given logical instance runs at a time; other instances stay in standby. Client HA guarantees that a single client consumes a logical instance to preserve ordering.

Canal Deployment and Usage

MySQL Configuration

Enable binlog and use row‑based replication:

[mysqld]
log-bin=mysql-bin # enable binlog
binlog-format=ROW # row mode
server_id=1 # unique server id

Restart MySQL after editing the configuration.

Canal Configuration

Canal consists of canal.properties (server‑level) and instance.properties (instance‑level). The spring directory contains XML files that define the storage mode (memory, file, mixed, etc.).

Key instance types:

memory‑instance.xml – all components in memory, fastest but no persistence.

file‑instance.xml – file‑based persistence, no HA.

default‑instance.xml – ZooKeeper‑backed persistence, supports HA.

group‑instance.xml – logical grouping of multiple physical instances.

Example Deployment

Create a MySQL user for replication:

CREATE USER canal IDENTIFIED BY 'canal';
GRANT SELECT, REPLICATION SLAVE, REPLICATION CLIENT ON *.* TO 'canal'@'%';
FLUSH PRIVILEGES;

Use the default configuration files, then start Canal:

sh bin/startup.sh

HA Configuration Example

# ZooKeeper cluster address
canal.zkServers=10.20.144.51:2181
# Choose the global instance configuration
canal.instance.global.spring.xml=classpath:spring/default-instance.xml

canal.instance.mysql.slaveId=1234   # another machine uses 1235
canal.instance.master.address=10.20.144.15:3306

Maxwell Overview

Maxwell also reads MySQL binlog but bundles a Kafka producer, so it can directly write JSON events to Kafka without a custom client.

Sample JSON format:

{"database":"test","table":"e","type":"update","ts":1488857869,"xid":8924,"commit":true,"data":{"id":1,"m":5.556666,"torvalds":null},"old":{"m":5.55}}

Advantages: bootstrap support, built‑in Kafka integration, schema‑aware JSON. Drawbacks: one Maxwell process per MySQL instance, bootstrap uses a full table scan.

Tool Comparison

Both Canal and Maxwell capture MySQL binlog changes and push them to Kafka; downstream processing (e.g., writing to HDFS, Hive, Elasticsearch, Redis) must be implemented separately.

Key differences:

Canal requires a custom client to write to Kafka; Maxwell provides it out‑of‑the‑box.

Canal supports HA via ZooKeeper; Maxwell does not.

Canal does not handle historical data sync automatically; Maxwell offers a bootstrap mode.

Solution Design

Two possible solutions are discussed:

Solution 1: Use Canal for binlog extraction, develop a data‑conversion tool and a routing component to write to Kafka and then to target stores.

Solution 2: Use Maxwell for extraction and conversion, add a routing component, and optionally implement HA later.

Both solutions require a data‑routing tool that consumes Kafka topics and writes to downstream systems such as Hive/Parquet, Elasticsearch/HBase, Redis/Alluxio, or relational databases.

Choosing Solution 2 can deliver faster proof‑of‑concept results, while keeping the routing component compatible with a future Canal‑based pipeline.

Overall, the incremental log serves as the backbone for downstream data platforms, enabling real‑time analytics, search, caching, and synchronization across heterogeneous systems.

---

For readers who find the article useful, feel free to like, bookmark, or share it.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Big Data kafka mysql Binlog Canal

Written by

Big Data Technology & Architecture

Wang Zhiwu, a big data expert, dedicated to sharing big data technology.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.