Databases 15 min read

Alibaba’s Open‑Source DataX: Fast, Easy Offline Data Synchronization

This article introduces Alibaba’s open‑source DataX tool, explains its framework‑plugin architecture for heterogeneous database sync, walks through Linux installation, job configuration, full‑ and incremental MySQL synchronization, and shares performance results and practical tips.

Architect's Guide
Architect's Guide
Architect's Guide
Alibaba’s Open‑Source DataX: Fast, Easy Offline Data Synchronization

Architecture Overview

DataX follows a Framework + Plugin design. Data source read and write operations are abstracted as Reader and Writer plugins. A central Framework connects the plugins, handling buffering, flow control, concurrency, and data conversion.

Roles : Reader – collects data from the source and sends it to the Framework. Writer – pulls data from the Framework and writes it to the destination. Framework – acts as the transmission channel between Reader and Writer, managing core technical concerns such as buffering and flow control.

Job Execution Model

When a Job is submitted, DataX splits it into multiple Task instances according to the source’s split strategy. Tasks are grouped into TaskGroup s based on the configured concurrency (the channel setting). Each TaskGroup launches its Tasks in parallel. The job finishes when all TaskGroup s complete; a non‑zero exit code indicates failure.

Installation on Linux

JDK 1.8+ (recommended 1.8)

Python 2 or 3 (pre‑installed on CentOS 7)

Apache Maven 3.x (only required to compile DataX; the binary tar package can be used directly)

# wget http://datax-opensource.oss-cn-hangzhou.aliyuncs.com/datax.tar.gz
# tar zxf datax.tar.gz -C /usr/local/
# rm -rf /usr/local/datax/plugin/*/._*   # remove hidden files generated by the archive

Verify the installation:

# cd /usr/local/datax/bin
# python datax.py ../job/job.json   # test run

Sample output shows processing of 100 000 records at ~254 KB/s with zero errors.

Basic Usage

Generate a default job skeleton using the built‑in streamreaderstreamwriter template:

# python /usr/local/datax/bin/datax.py -r streamreader -w streamwriter

The command prints a JSON job skeleton that can be saved and edited. Documentation links are printed in the output:

https://github.com/alibaba/DataX/blob/master/streamreader/doc/streamreader.md

https://github.com/alibaba/DataX/blob/master/streamwriter/doc/streamwriter.md

Full‑Copy MySQL Synchronization

Prepare two MySQL hosts (CentOS 7.4, IP 192.168.1.1 and 192.168.1.2) and install MariaDB:

# yum -y install mariadb mariadb-server mariadb-libs mariadb-devel
# systemctl start mariadb
# mysql_secure_installation   # set root password to 123123

Create identical schema on both hosts:

CREATE DATABASE `course-study`;
CREATE TABLE `course-study`.t_member(ID int, Name varchar(20), Email varchar(30));

Grant full privileges to the root user on each host:

grant all privileges on *.* to root@'%' identified by '123123';
flush privileges;

Generate a MySQL‑to‑MySQL job template:

# python /usr/local/datax/bin/datax.py -r mysqlreader -w mysqlwriter

Fill in the generated JSON (example install.json) with source and target JDBC URLs, tables, and credentials. The preSql step truncates the target table before loading.

{
  "job": {
    "content": [{
      "reader": {
        "name": "mysqlreader",
        "parameter": {
          "username": "root",
          "password": "123123",
          "column": ["*"],
          "splitPk": "ID",
          "connection": [{
            "jdbcUrl": ["jdbc:mysql://192.168.1.1:3306/course-study?useUnicode=true&characterEncoding=utf8"],
            "table": ["t_member"]
          }]
        }
      },
      "writer": {
        "name": "mysqlwriter",
        "parameter": {
          "username": "root",
          "password": "123123",
          "column": ["*"],
          "connection": [{
            "jdbcUrl": ["jdbc:mysql://192.168.1.2:3306/course-study?useUnicode=true&characterEncoding=utf8"],
            "table": ["t_member"]
          }],
          "preSql": ["truncate t_member"],
          "session": ["set session sql_mode='ANSI'"],
          "writeMode": "insert"
        }
      }
    }],
    "setting": {"speed": {"channel": "5"}}
  }
}

Run the job:

# python /usr/local/datax/bin/datax.py install.json

Result: 2 999 999 records transferred in 42 s at 2.57 MB/s (≈75 k records/s) with zero errors.

Incremental Synchronization

The only difference from full copy is the addition of a where clause to filter rows. Example where.json:

{
  "job": {
    "content": [{
      "reader": {
        "name": "mysqlreader",
        "parameter": {
          "username": "root",
          "password": "123123",
          "column": ["*"],
          "splitPk": "ID",
          "where": "ID <= 1888",
          "connection": [{
            "jdbcUrl": ["jdbc:mysql://192.168.1.1:3306/course-study?useUnicode=true&characterEncoding=utf8"],
            "table": ["t_member"]
          }]
        }
      },
      "writer": {
        "name": "mysqlwriter",
        "parameter": {
          "username": "root",
          "password": "123123",
          "column": ["*"],
          "connection": [{
            "jdbcUrl": ["jdbc:mysql://192.168.1.2:3306/course-study?useUnicode=true&characterEncoding=utf8"],
            "table": ["t_member"]
          }],
          "preSql": ["truncate t_member"],
          "session": ["set session sql_mode='ANSI'"],
          "writeMode": "insert"
        }
      }
    }],
    "setting": {"speed": {"channel": "5"}}
  }
}

Running the job yields 1 888 records transferred in 32 s at 1.61 KB/s (≈62 records/s) with no errors.

For subsequent incremental runs, adjust the where clause, e.g. "ID > 1888 AND ID <= 2888", and remove the preSql truncate step.

Key Observations

DataX can process millions of rows within seconds; the full‑copy example achieved ~75 k records/s.

Incremental sync is achieved solely by adding a where filter; the rest of the job definition remains unchanged.

Concurrency is controlled by the channel parameter inside setting.speed. In the examples, a channel value of 5 was used.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

MySQLopen sourcedata synchronizationDataXETLdatabase migrationincremental sync
Architect's Guide
Written by

Architect's Guide

Dedicated to sharing programmer-architect skills—Java backend, system, microservice, and distributed architectures—to help you become a senior architect.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.