Real‑time MySQL Binlog Capture with Canal: Principles, Architecture, Deployment and Comparison with Maxwell
This article explains how to use Alibaba's Canal to capture MySQL binlog changes in real time, covering its underlying protocol, component architecture, HA design with ZooKeeper, configuration steps, deployment examples, and a detailed comparison with alternative tools such as Maxwell and mysql_streamer.
Companies that need real‑time data synchronization often face the problem of incremental MySQL binlog capture; Canal is a popular open‑source solution for this scenario.
Canal Overview
Principle
Canal simulates a MySQL slave, sends a dump request to the master, receives binary log streams, parses the byte‑level log entries, and forwards them to downstream components.
Architecture
The system consists of a Server (one JVM instance) that can host multiple Instance objects, each representing a data queue. An instance is composed of EventParser, EventSink, EventStore and MetaManager.
Component Details
Server: a Canal runtime instance (one JVM).
Instance: a logical data queue; a server may host many instances.
EventParser: reads the last processed binlog position, issues a dump request, and parses the incoming binlog.
EventSink: acts as a channel, performing filtering, routing (1:n), merging (n:1) and transformation before passing data to the store.
EventStore: an in‑memory circular queue identified by Put, Get and Ack pointers.
MetaManager: manages incremental subscription and consumption metadata, providing getWithoutAck, ack and rollback APIs.
HA Mechanism
Canal achieves high availability via ZooKeeper. Server HA ensures that only one instance of a given logical instance runs at a time; other instances stay in standby. Client HA guarantees that a single client consumes a logical instance to preserve ordering.
Canal Deployment and Usage
MySQL Configuration
Enable binlog and use row‑based replication:
[mysqld]
log-bin=mysql-bin # enable binlog
binlog-format=ROW # row mode
server_id=1 # unique server idRestart MySQL after editing the configuration.
Canal Configuration
Canal consists of canal.properties (server‑level) and instance.properties (instance‑level). The spring directory contains XML files that define the storage mode (memory, file, mixed, etc.).
Key instance types:
memory‑instance.xml – all components in memory, fastest but no persistence.
file‑instance.xml – file‑based persistence, no HA.
default‑instance.xml – ZooKeeper‑backed persistence, supports HA.
group‑instance.xml – logical grouping of multiple physical instances.
Example Deployment
Create a MySQL user for replication:
CREATE USER canal IDENTIFIED BY 'canal';
GRANT SELECT, REPLICATION SLAVE, REPLICATION CLIENT ON *.* TO 'canal'@'%';
FLUSH PRIVILEGES;Use the default configuration files, then start Canal:
sh bin/startup.shHA Configuration Example
# ZooKeeper cluster address
canal.zkServers=10.20.144.51:2181
# Choose the global instance configuration
canal.instance.global.spring.xml=classpath:spring/default-instance.xml canal.instance.mysql.slaveId=1234 # another machine uses 1235
canal.instance.master.address=10.20.144.15:3306Maxwell Overview
Maxwell also reads MySQL binlog but bundles a Kafka producer, so it can directly write JSON events to Kafka without a custom client.
Sample JSON format:
{"database":"test","table":"e","type":"update","ts":1488857869,"xid":8924,"commit":true,"data":{"id":1,"m":5.556666,"torvalds":null},"old":{"m":5.55}}Advantages: bootstrap support, built‑in Kafka integration, schema‑aware JSON. Drawbacks: one Maxwell process per MySQL instance, bootstrap uses a full table scan.
Tool Comparison
Both Canal and Maxwell capture MySQL binlog changes and push them to Kafka; downstream processing (e.g., writing to HDFS, Hive, Elasticsearch, Redis) must be implemented separately.
Key differences:
Canal requires a custom client to write to Kafka; Maxwell provides it out‑of‑the‑box.
Canal supports HA via ZooKeeper; Maxwell does not.
Canal does not handle historical data sync automatically; Maxwell offers a bootstrap mode.
Solution Design
Two possible solutions are discussed:
Solution 1: Use Canal for binlog extraction, develop a data‑conversion tool and a routing component to write to Kafka and then to target stores.
Solution 2: Use Maxwell for extraction and conversion, add a routing component, and optionally implement HA later.
Both solutions require a data‑routing tool that consumes Kafka topics and writes to downstream systems such as Hive/Parquet, Elasticsearch/HBase, Redis/Alluxio, or relational databases.
Choosing Solution 2 can deliver faster proof‑of‑concept results, while keeping the routing component compatible with a future Canal‑based pipeline.
Overall, the incremental log serves as the backbone for downstream data platforms, enabling real‑time analytics, search, caching, and synchronization across heterogeneous systems.
---
For readers who find the article useful, feel free to like, bookmark, or share it.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Big Data Technology & Architecture
Wang Zhiwu, a big data expert, dedicated to sharing big data technology.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
