Big Data 15 min read

How to Sync Massive MySQL Datasets Efficiently with DataX

This guide walks through the challenges of synchronizing tens of millions of records between heterogeneous MySQL databases, explains why traditional mysqldump or file‑based methods fail, and provides a step‑by‑step tutorial on installing, configuring, and using Alibaba's open‑source DataX tool for both full and incremental data synchronization.

Selected Java Interview Questions
Selected Java Interview Questions
Selected Java Interview Questions
How to Sync Massive MySQL Datasets Efficiently with DataX

Preface

Our project needs to sync fifty million records, but the reporting database is inaccurate and cross‑database operations prevent using plain SQL. Initial ideas like mysqldump or file‑based storage proved impractical because backup and sync take too long and may generate new data during the process.

mysqldump: both backup and sync are time‑consuming; new data may appear during backup, effectively making sync ineffective.

File storage: extremely slow—only about two thousand rows synced in three hours for small data sets.

After researching, we discovered DataX, a tool that offers fast and reliable data synchronization.

1. DataX Overview

DataX is the open‑source version of Alibaba Cloud DataWorks Data Integration, designed for offline data sync between heterogeneous sources such as MySQL, Oracle, HDFS, Hive, ODPS, HBase, FTP, etc.

It transforms complex mesh sync topologies into a star‑shaped data flow, acting as a middle‑layer that connects various data sources.

1. DataX 3.0 Framework Design

DataX follows a Framework + Plugin architecture. Data source reading and writing are abstracted as Reader/Writer plugins that plug into the overall sync framework.

2. DataX 3.0 Core Architecture

A DataX job (Job) is the unit of work. When a Job starts, it spawns a process that handles data cleaning, task splitting, and TaskGroup management.

The Job splits into multiple small Tasks based on the source's split strategy, enabling concurrent execution.

Each Task is managed by a TaskGroup, which launches a pipeline: Reader → Channel → Writer.

The Job monitors all TaskGroups; when they all finish, the Job exits successfully (non‑zero exit code indicates failure).

2. Using DataX for Data Synchronization

Preparation

JDK 1.8+

Python 2 or 3

Apache Maven 3.x is required only when compiling DataX from source.

[root@MySQL-1 ~]# ls
anaconda-ks.cfg  jdk-8u181-linux-x64.tar.gz
[root@MySQL-1 ~]# tar zxf jdk-8u181-linux-x64.tar.gz
[root@DataX ~]# ls
anaconda-ks.cfg  jdk1.8.0_181  jdk-8u181-linux-x64.tar.gz
[root@MySQL-1 ~]# mv jdk1.8.0_181 /usr/local/java
[root@MySQL-1 ~]# cat <<END >> /etc/profile
export JAVA_HOME=/usr/local/java
export PATH=$PATH:"$JAVA_HOME/bin"
END
[root@MySQL-1 ~]# source /etc/profile
[root@MySQL-1 ~]# java -version

CentOS 7 already includes Python 2.7, so no additional installation is needed.

1. Install DataX on Linux

[root@MySQL-1 ~]# wget http://datax-opensource.oss-cn-hangzhou.aliyuncs.com/datax.tar.gz
[root@MySQL-1 ~]# tar zxf datax.tar.gz -C /usr/local/
[root@MySQL-1 ~]# rm -rf /usr/local/datax/plugin/*/._*   # delete hidden files (important)

If hidden files are not removed, you may see errors like

[/usr/local/datax/plugin/reader/._drdsreader/plugin.json] does not exist. Please check your configuration file.

Verification:

[root@MySQL-1 ~]# cd /usr/local/datax/bin
[root@MySQL-1 ~]# python datax.py ../job/job.json   # verify installation

Typical output shows records processed and speed metrics.

2. Basic DataX Usage

[root@MySQL-1 ~]# python /usr/local/datax/bin/datax.py -r streamreader -w streamwriter

Running this prints the DataX version and usage instructions.

3. Install MySQL Databases

[root@MySQL-1 ~]# yum -y install mariadb mariadb-server mariadb-libs mariadb-devel
[root@MySQL-1 ~]# systemctl start mariadb
[root@MySQL-1 ~]# mysql_secure_installation

Set root password, remove anonymous users, allow remote root login, and delete test databases as needed.

1) Prepare Data for Sync (create identical tables on both hosts)

MariaDB [(none)]> create database course-study;
Query OK, 1 row affected (0.00 sec)
MariaDB [(none)]> create table course-study.t_member(ID int,Name varchar(20),Email varchar(30));
Query OK, 0 rows affected (0.00 sec)

Grant privileges for DataX to access both databases:

grant all privileges on *.* to root@'%' identified by '123123';
flush privileges;

2) Create a stored procedure to generate test data

DELIMITER $$
CREATE PROCEDURE test()
BEGIN
  DECLARE A int default 1;
  WHILE (A < 3000000) DO
    INSERT INTO course-study.t_member VALUES (A, CONCAT('LiSa',A), CONCAT('LiSa',A,'@163.com'));
    SET A = A + 1;
  END WHILE;
END $$
DELIMITER ;
CALL test();

3) Sync MySQL to MySQL with DataX

1) Generate a template JSON for MySQL→MySQL

[root@MySQL-1 ~]# python /usr/local/datax/bin/datax.py -r mysqlreader -w mysqlwriter
{
  "job": {
    "content": [
      {
        "reader": {
          "name": "mysqlreader",
          "parameter": {
            "column": [],
            "connection": [{"jdbcUrl": [], "table": []}],
            "username": "",
            "password": "",
            "where": ""
          }
        },
        "writer": {
          "name": "mysqlwriter",
          "parameter": {
            "column": [],
            "connection": [{"jdbcUrl": "", "table": []}],
            "username": "",
            "password": "",
            "preSql": [],
            "session": [],
            "writeMode": ""
          }
        }
      }
    ],
    "setting": {"speed": {"channel": ""}}
  }
}

2) Write the actual JSON configuration

{
  "job": {
    "content": [
      {
        "reader": {
          "name": "mysqlreader",
          "parameter": {
            "username": "root",
            "password": "123123",
            "column": ["*"],
            "splitPk": "ID",
            "connection": [{
              "jdbcUrl": ["jdbc:mysql://192.168.1.1:3306/course-study?useUnicode=true&characterEncoding=utf8"],
              "table": ["t_member"]
            }]
          }
        },
        "writer": {
          "name": "mysqlwriter",
          "parameter": {
            "column": ["*"],
            "connection": [{
              "jdbcUrl": "jdbc:mysql://192.168.1.2:3306/course-study?useUnicode=true&characterEncoding=utf8",
              "table": ["t_member"]
            }],
            "username": "root",
            "password": "123123",
            "preSql": ["truncate t_member"],
            "session": ["set session sql_mode='ANSI'"],
            "writeMode": "insert"
          }
        }
      }
    ],
    "setting": {"speed": {"channel": "5"}}
  }
}

Run the job:

[root@MySQL-1 ~]# python /usr/local/datax/bin/datax.py install.json

The output shows total records processed, speed, and confirms successful synchronization.

4. Incremental Synchronization with DataX

The only difference between full and incremental sync is the use of the where clause to filter records.

1) Create a JSON for incremental sync

{
  "job": {
    "content": [
      {
        "reader": {
          "name": "mysqlreader",
          "parameter": {
            "username": "root",
            "password": "123123",
            "column": ["*"],
            "splitPk": "ID",
            "where": "ID <= 1888",
            "connection": [{
              "jdbcUrl": ["jdbc:mysql://192.168.1.1:3306/course-study?useUnicode=true&characterEncoding=utf8"],
              "table": ["t_member"]
            }]
          }
        },
        "writer": {
          "name": "mysqlwriter",
          "parameter": {
            "column": ["*"],
            "connection": [{
              "jdbcUrl": "jdbc:mysql://192.168.1.2:3306/course-study?useUnicode=true&characterEncoding=utf8",
              "table": ["t_member"]
            }],
            "username": "root",
            "password": "123123",
            "preSql": ["truncate t_member"],
            "session": ["set session sql_mode='ANSI'"],
            "writeMode": "insert"
          }
        }
      }
    ],
    "setting": {"speed": {"channel": "5"}}
  }
}

Run the incremental job and verify the target database contains only the filtered rows.

Big DataMySQLData SynchronizationDataXETL
Selected Java Interview Questions
Written by

Selected Java Interview Questions

A professional Java tech channel sharing common knowledge to help developers fill gaps. Follow us!

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.