Big Data 16 min read

Using DataX for Efficient Data Synchronization Between MySQL Databases

This article explains how to employ Alibaba's open‑source DataX tool to perform fast, reliable full‑ and incremental data synchronization between MySQL instances, covering installation, framework design, job execution, and practical shell commands for Linux environments.

Architecture Digest
Architecture Digest
Architecture Digest
Using DataX for Efficient Data Synchronization Between MySQL Databases

Introduction

Our company had a project with 50 million records, but because the reporting data was inaccurate and the business and reporting databases operated across different instances, SQL‑based synchronization was not feasible. Attempts with mysqldump and file‑based storage proved too slow, prompting us to evaluate DataX.

1. DataX Overview

DataX is the open‑source version of Alibaba Cloud DataWorks Data Integration, primarily used for offline data synchronization. It supports heterogeneous data sources such as relational databases (MySQL, Oracle, etc.), HDFS, Hive, ODPS, HBase, FTP, and more, providing stable and high‑performance data transfer.

Transforms complex mesh‑type sync links into a star topology, with DataX acting as the central transport carrier.

Adding a new data source only requires plugging it into DataX, enabling seamless synchronization with existing sources.

1. DataX 3.0 Framework Design

DataX adopts a Framework + Plugin architecture. Data source reading and writing are abstracted as Reader and Writer plugins, which are integrated into the overall synchronization framework.

Role

Function

Reader (采集模块)

Collects data from the source and forwards it to the Framework.

Writer (写入模块)

Pulls data from the Framework and writes it to the destination.

Framework (中间商)

Connects Reader and Writer, handling buffering, flow control, concurrency, and data conversion.

2. DataX 3.0 Core Architecture

When a DataX Job starts, it splits the job into multiple Tasks based on the source's partitioning strategy, enabling concurrent execution. The Scheduler assembles Tasks into TaskGroups according to the configured concurrency. Each Task runs a fixed pipeline: Reader → Channel → Writer. The Job monitors all TaskGroups and exits successfully once every group finishes.

DataX Job splits into Tasks according to source split keys.

Scheduler groups Tasks into TaskGroups based on the user‑defined channel count.

Each Task executes the Reader → Channel → Writer thread chain.

The Job monitors TaskGroups and exits when all complete (non‑zero exit code on failure).

2. Using DataX for Data Synchronization

Preparation :

JDK 1.8+ (recommended 1.8)

Python 2 or 3 (CentOS 7 already includes Python 2.7)

Apache Maven 3.x (only needed if compiling DataX; the tar package can be used directly)

1. Install JDK on Linux

[root@MySQL-1 ~]# ls
anaconda-ks.cfg  jdk-8u181-linux-x64.tar.gz
[root@MySQL-1 ~]# tar zxf jdk-8u181-linux-x64.tar.gz
[root@DataX ~]# ls
anaconda-ks.cfg  jdk1.8.0_181  jdk-8u181-linux-x64.tar.gz
[root@MySQL-1 ~]# mv jdk1.8.0_181 /usr/local/java
[root@MySQL-1 ~]# cat <
> /etc/profile
export JAVA_HOME=/usr/local/java
export PATH=$PATH:"$JAVA_HOME/bin"
END
[root@MySQL-1 ~]# source /etc/profile
[root@MySQL-1 ~]# java -version

Python 2.7 is pre‑installed on CentOS 7, so no additional installation is required.

2. Install DataX Software

[root@MySQL-1 ~]# wget http://datax-opensource.oss-cn-hangzhou.aliyuncs.com/datax.tar.gz
[root@MySQL-1 ~]# tar zxf datax.tar.gz -C /usr/local/
[root@MySQL-1 ~]# rm -rf /usr/local/datax/plugin/*/._*   # delete hidden files (important)

If hidden files are not removed, you may see errors such as:

[/usr/local/datax/plugin/reader/._drdsreader/plugin.json] does not exist. Please check your configuration file.

3. Verify Installation

[root@MySQL-1 ~]# cd /usr/local/datax/bin
[root@MySQL-1 ~]# python datax.py ../job/job.json   # test run

Typical output shows job statistics and confirms successful execution.

4. Install MySQL (MariaDB) on Both Hosts

[root@MySQL-1 ~]# yum -y install mariadb mariadb-server mariadb-libs mariadb-devel
[root@MySQL-1 ~]# systemctl start mariadb
[root@MySQL-1 ~]# mysql_secure_installation
# (follow prompts to set root password, remove anonymous users, keep remote root login, remove test database, reload privileges)

5. Prepare Data for Synchronization

MariaDB [(none)]> create database `course-study`;
MariaDB [(none)]> create table `course-study`.t_member(ID int,Name varchar(20),Email varchar(30));

Grant full privileges to allow DataX to access both databases:

grant all privileges on *.* to root@'%' identified by '123123';
flush privileges;

6. Create a Stored Procedure to Generate Test Data

DELIMITER $$
CREATE PROCEDURE test()
BEGIN
  DECLARE A int default 1;
  WHILE (A < 3000000) DO
    INSERT INTO `course-study`.t_member VALUES(A, concat('LiSa',A), concat('LiSa',A,'@163.com'));
    SET A = A + 1;
  END WHILE;
END $$
DELIMITER ;

CALL test();

7. Build MySQL‑to‑MySQL Sync Job

Generate a template:

[root@MySQL-1 ~]# python /usr/local/datax/bin/datax.py -r mysqlreader -w mysqlwriter

Save the following JSON as install.json (replace placeholders with actual IPs, usernames, and passwords):

{
  "job": {
    "content": [{
      "reader": {
        "name": "mysqlreader",
        "parameter": {
          "username": "root",
          "password": "123123",
          "column": ["*"],
          "splitPk": "ID",
          "connection": [{
            "jdbcUrl": ["jdbc:mysql://192.168.1.1:3306/course-study?useUnicode=true&characterEncoding=utf8"],
            "table": ["t_member"]
          }]
        }
      },
      "writer": {
        "name": "mysqlwriter",
        "parameter": {
          "username": "root",
          "password": "123123",
          "column": ["*"],
          "connection": [{
            "jdbcUrl": "jdbc:mysql://192.168.1.2:3306/course-study?useUnicode=true&characterEncoding=utf8",
            "table": ["t_member"]
          }],
          "preSql": ["truncate t_member"],
          "session": ["set session sql_mode='ANSI'"],
          "writeMode": "insert"
        }
      }
    }],
    "setting": {"speed": {"channel": "5"}}
  }
}

Run the job:

[root@MySQL-1 ~]# python /usr/local/datax/bin/datax.py install.json

The output confirms that ~3 million records were transferred at ~2.57 MB/s with zero errors.

8. Incremental Synchronization

For incremental sync, add a where clause to the reader configuration and remove the preSql that truncates the target table. Example where.json :

{
  "job": {
    "content": [{
      "reader": {
        "name": "mysqlreader",
        "parameter": {
          "username": "root",
          "password": "123123",
          "column": ["*"],
          "splitPk": "ID",
          "where": "ID <= 1888",
          "connection": [{
            "jdbcUrl": ["jdbc:mysql://192.168.1.1:3306/course-study?useUnicode=true&characterEncoding=utf8"],
            "table": ["t_member"]
          }]
        }
      },
      "writer": {
        "name": "mysqlwriter",
        "parameter": {
          "username": "root",
          "password": "123123",
          "column": ["*"],
          "connection": [{
            "jdbcUrl": "jdbc:mysql://192.168.1.2:3306/course-study?useUnicode=true&characterEncoding=utf8",
            "table": ["t_member"]
          }],
          "preSql": ["truncate t_member"],
          "session": ["set session sql_mode='ANSI'"],
          "writeMode": "insert"
        }
      }
    }],
    "setting": {"speed": {"channel": "5"}}
  }
}

Running this job syncs only the rows that satisfy the where condition (e.g., 1 888 records). Subsequent incremental runs can adjust the where clause (e.g., ID > 1888 AND ID <= 2888 ) and omit the preSql step.

In summary, DataX provides a fast, extensible framework for both full‑ and incremental data synchronization across heterogeneous databases, making it suitable for large‑scale data migration and ongoing ETL pipelines.

Big DataLinuxMySQLData SynchronizationDataXetlshell
Architecture Digest
Written by

Architecture Digest

Focusing on Java backend development, covering application architecture from top-tier internet companies (high availability, high performance, high stability), big data, machine learning, Java architecture, and other popular fields.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.