Big Data 17 min read

Master Data Synchronization with Alibaba DataX: From Installation to Incremental Sync

This guide explains how to use Alibaba's open‑source DataX tool to synchronize large MySQL datasets, covering the tool’s architecture, installation on Linux, job configuration with JSON, full‑load and incremental sync examples, and performance results, all without relying on mysqldump or manual storage methods.

Java Architect Handbook
Java Architect Handbook
Java Architect Handbook
Master Data Synchronization with Alibaba DataX: From Installation to Incremental Sync

Background

Company needed to synchronize 50 million rows between a business database and a reporting database; traditional methods such as mysqldump or file‑based storage were too slow and could produce inconsistent data.

What is DataX?

DataX is the open‑source version of Alibaba Cloud DataWorks data integration. It provides high‑performance offline synchronization for heterogeneous sources such as MySQL, Oracle, HDFS, Hive, ODPS, HBase, FTP and others.

Architecture

DataX follows a Framework + Plugin model. Reader plugins collect data from a source, Writer plugins write data to a target, and the central Framework coordinates the flow, buffering, flow‑control, concurrency and data conversion.

Reader – collects data from the source and pushes it to the Framework.

Writer – pulls data from the Framework and writes it to the destination.

Framework – connects Reader and Writer, handling buffering, flow control, concurrency and data conversion.

Job Scheduling

When a job starts, DataX splits it into multiple Task units according to the source’s split strategy.

Tasks are grouped into TaskGroup objects based on the user‑defined concurrency level.

Each Task launches a Reader, streams data through a Channel to a Writer thread.

The Job monitors all TaskGroups; it exits with code 0 on success, non‑zero on failure.

Installation on Linux

JDK 1.8 or higher (recommended 1.8)

Python 2 or 3 (CentOS 7 includes Python 2.7)

Apache Maven 3.x (only needed to compile DataX; the binary tarball does not require Maven)

Download and extract DataX:

# wget http://datax-opensource.oss-cn-hangzhou.aliyuncs.com/datax.tar.gz
# tar zxf datax.tar.gz -C /usr/local/
# rm -rf /usr/local/datax/plugin/*/._*   # delete hidden files to avoid errors

Verify the installation:

# cd /usr/local/datax/bin
# python datax.py ../job/job.json   # displays version info and usage

Preparing MySQL Databases

Install MariaDB (compatible with MySQL) on both hosts, start the service and secure the installation.

# yum -y install mariadb mariadb-server mariadb-libs mariadb-devel
# systemctl start mariadb
# mysql_secure_installation   # set root password, keep remote root login enabled

Create the test database and table on each host:

CREATE DATABASE `course-study`;
CREATE TABLE `course-study`.t_member (ID INT, Name VARCHAR(20), Email VARCHAR(30));
GRANT ALL PRIVILEGES ON *.* TO 'root'@'%' IDENTIFIED BY '123123';
FLUSH PRIVILEGES;

Optional stored procedure to generate millions of rows for testing:

DELIMITER $$
CREATE PROCEDURE test()
BEGIN
  DECLARE A INT DEFAULT 1;
  WHILE (A < 3000000) DO
    INSERT INTO `course-study`.t_member VALUES (A, CONCAT('LiSa', A), CONCAT('LiSa', A, '@163.com'));
    SET A = A + 1;
  END WHILE;
END $$
DELIMITER ;
CALL test();

DataX Job Configuration

Jobs are defined in JSON files. Below are minimal templates.

Stream Reader / Writer Template

{
  "job": {
    "content": [{
      "reader": {"name": "streamreader", "parameter": {}},
      "writer": {"name": "streamwriter", "parameter": {"encoding": "", "print": true}}
    }],
    "setting": {"speed": {"channel": ""}}
  }
}

MySQL Full‑Load Template

{
  "job": {
    "content": [{
      "reader": {
        "name": "mysqlreader",
        "parameter": {
          "username": "root",
          "password": "123123",
          "column": ["*"],
          "splitPk": "ID",
          "connection": [{
            "jdbcUrl": ["jdbc:mysql://192.168.1.1:3306/course-study?useUnicode=true&characterEncoding=utf8"],
            "table": ["t_member"]
          }]
        }
      },
      "writer": {
        "name": "mysqlwriter",
        "parameter": {
          "username": "root",
          "password": "123123",
          "column": ["*"],
          "preSql": ["TRUNCATE t_member"],
          "session": ["SET SESSION sql_mode='ANSI'"],
          "writeMode": "insert",
          "connection": [{
            "jdbcUrl": ["jdbc:mysql://192.168.1.2:3306/course-study?useUnicode=true&characterEncoding=utf8"],
            "table": ["t_member"]
          }]
        }
      }
    }],
    "setting": {"speed": {"channel": "5"}}
  }
}

Run the job:

# python /usr/local/datax/bin/datax.py -r mysqlreader -w mysqlwriter /path/to/install.json

Sample output shows processing of 2,999,999 records at ~2.57 MB/s with zero errors.

Incremental Synchronization

The only difference from a full load is the addition of a where clause in the Reader and optional preSql in the Writer.

Incremental Job Example

{
  "job": {
    "content": [{
      "reader": {
        "name": "mysqlreader",
        "parameter": {
          "username": "root",
          "password": "123123",
          "column": ["*"],
          "splitPk": "ID",
          "where": "ID <= 1888",
          "connection": [{
            "jdbcUrl": ["jdbc:mysql://192.168.1.1:3306/course-study?useUnicode=true&characterEncoding=utf8"],
            "table": ["t_member"]
          }]
        }
      },
      "writer": {
        "name": "mysqlwriter",
        "parameter": {
          "username": "root",
          "password": "123123",
          "column": ["*"],
          "preSql": ["TRUNCATE t_member"],
          "session": ["SET SESSION sql_mode='ANSI'"],
          "writeMode": "insert",
          "connection": [{
            "jdbcUrl": ["jdbc:mysql://192.168.1.2:3306/course-study?useUnicode=true&characterEncoding=utf8"],
            "table": ["t_member"]
          }]
        }
      }
    }],
    "setting": {"speed": {"channel": "5"}}
  }
}

Running this job synchronizes only the 1,888 rows matching the condition, completing in ~32 seconds at 62 records/s.

Key Takeaways

DataX provides fast, reliable offline synchronization for large data volumes across heterogeneous sources.

Full‑load jobs require only basic reader/writer configuration; incremental jobs add a where filter and optional pre‑SQL handling.

Correct installation, environment preparation and database permissions are essential for successful execution.

DataX architecture diagram
DataX architecture diagram
big dataJSONLinuxMySQLData SynchronizationDataXETLIncremental Sync
Java Architect Handbook
Written by

Java Architect Handbook

Focused on Java interview questions and practical article sharing, covering algorithms, databases, Spring Boot, microservices, high concurrency, JVM, Docker containers, and ELK-related knowledge. Looking forward to progressing together with you.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.