Big Data 10 min read

Implementing MySQL Binlog Synchronization to HDFS Using Canal

This article details a step‑by‑step guide for deploying Canal to capture MySQL binlog events, configure HA with ZooKeeper, design a client that parses binlog into JSON, asynchronously acknowledges messages, archive data to local files for batch upload to HDFS, and monitor latency for alerts.

Big Data Technology & Architecture
Big Data Technology & Architecture
Big Data Technology & Architecture
Implementing MySQL Binlog Synchronization to HDFS Using Canal

The article explains how to use Canal to synchronize MySQL binlog data to HDFS, providing a practical solution for real‑time data ingestion in big‑data pipelines.

Canal server deployment – each MySQL instance corresponds to a configuration folder under conf. The folder name represents the instance. Example file listing:

-rwxr-xr-x 1 dc user 2645 Jul 18 14:25 canal.properties
-rwxr-xr-x 1 dc user 2521 Jul 17 18:31 canal.properties.bak
-rwxr-xr-x 1 dc user 3045 Jul 17 18:31 logback.xml
drwxr-xr-x 2 dc user 4096 Jul 17 18:38 spring
drwxr-xr-x 2 dc user 4096 Jul 19 11:55 trans1

The trans1 folder contains instance.properties where MySQL connection details are set, e.g.:

## mysql serverId deployment, slaveId must be unique
canal.instance.mysql.slaveId = 1235
canal.instance.master.address = 10.172.152.66:3306
# username/password
canal.instance.dbUsername = root
canal.instance.dbPassword = root
# table filter regex
canal.instance.filter.regex = .*\..*

HA deployment – Canal’s high‑availability relies on ZooKeeper. The main configuration ( canal.properties) includes:

# common arguments
canal.id= 1            # another node canal.id= 2
canal.ip= 10.172.152.66 # another node canal.ip= 10.172.152.124
canal.port= 11111
canal.zkServers= 10.172.152.66:2181,10.172.152.124:2181,10.172.152.125:2181
# destination name (MySQL instance folder)
canal.destinations= trans1
# HA requires default‑instance.xml
canal.instance.global.spring.xml = classpath:spring/default-instance.xml

Client design – The client connects to a Canal destination, consumes binlog messages, parses them into JSON, and writes them to local files for later HDFS upload. It uses an asynchronous ack/rollback mechanism to improve throughput.

Data format: DML events are converted to JSON with fields such as database, table, type, executeTime, consumerTime, and a data array. DDL events include an additional isDDL flag and the original SQL statement.

{
  "database": "test",
  "table": "e",
  "type": "insert",
  "executeTime": "1501645930000",
  "consumerTime": "1501645930000",
  "data": [{"id":"1","num":"1"},{"id":"2","num":"2"}]
}

{
  "database": "test",
  "table": "e",
  "type": "create",
  "executeTime": "1501645930000",
  "consumerTime": "1501645930000",
  "isDDL": "true",
  "sql": "create table e(id int)"
}

Configuration parameters are externalized for flexibility, e.g. ZooKeeper hosts, destination name, table filter regex, batch size, output paths, and delay thresholds for alerting.

Asynchronous processing – A pseudocode loop fetches messages without ack, puts the batch ID into an ackQueue, and processes each message in a separate thread. After processing, the thread acknowledges the batch in order; on failure, a rollback is performed.

try{
    while(true){
        Message message = connector.getWithoutAck(BATCH_SIZE);
        batchId = message.getId();
        ackQueue.put(batchId);
        executorService.submit(new Runnable(){
            public void run(){
                parseMessage(message);
                ackMessage(batchId);
            }
        });
    }
}catch(Exception e){
    connector.rollback(batchId);
}finally{ ... }

For data archiving, the pipeline writes to local files (splitting by execution time) and later batches uploads to HDFS, reducing small‑file problems. Cross‑region transfer can use either rsync or Avro RPC to a Flume agent; the article recommends rsync for simplicity.

Monitoring and alerting – The system monitors consumption latency by comparing the current processing time with the binlog executeTime. If the delay exceeds a configured threshold, an alert is triggered.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

Big Datadata pipelinemysqlBinlogCanalHDFS
Big Data Technology & Architecture
Written by

Big Data Technology & Architecture

Wang Zhiwu, a big data expert, dedicated to sharing big data technology.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.