Big Data 15 min read

Using Alibaba DataX for Offline Data Synchronization and Incremental Sync

This article introduces Alibaba DataX, explains its architecture and role in offline heterogeneous data synchronization, provides step‑by‑step Linux installation, demonstrates full‑load and incremental MySQL‑to‑MySQL sync with JSON job templates, and shares practical tips for handling large data volumes.

Code Ape Tech Column

Jan 28, 2023

Using Alibaba DataX for Offline Data Synchronization and Incremental Sync

DataX Overview

DataX is the open‑source version of Alibaba Cloud DataWorks data integration, primarily used for offline synchronization of heterogeneous data sources such as relational databases (MySQL, Oracle), HDFS, Hive, ODPS, HBase, FTP, etc.

Framework Design

DataX follows a Framework + Plugin architecture. Data sources are accessed via Reader plugins, destinations via Writer plugins, and the Framework connects Readers and Writers, handling buffering, flow control, concurrency, and data conversion.

Role

Function Reader Collects data from the source and sends it to the Framework. Writer Pulls data from the Framework and writes it to the target. Framework Links Reader and Writer, acting as the data transmission channel.

Core Architecture

A DataX job is divided into Tasks, which are grouped into TaskGroups for concurrent execution. The Job monitors TaskGroups and exits successfully when all complete.

Preparing the Environment

Required software:

JDK 1.8+

Python 2 or 3

Apache Maven 3.x (only for compiling DataX; the tar package does not need Maven)

Example host configuration:

Hostname

Package

MySQL-1

CentOS 7.4

192.168.1.1

jdk-8u181-linux-x64.tar.gz

datax.tar.gz

MySQL-2

CentOS 7.4

192.168.1.2

Installing JDK

[root@MySQL-1 ~]# ls
anaconda-ks.cfg  jdk-8u181-linux-x64.tar.gz
[root@MySQL-1 ~]# tar zxf jdk-8u181-linux-x64.tar.gz 
[root@DataX ~]# ls
anaconda-ks.cfg  jdk1.8.0_181  jdk-8u181-linux-x64.tar.gz
[root@MySQL-1 ~]# mv jdk1.8.0_181 /usr/local/java
[root@MySQL-1 ~]# cat <<END >> /etc/profile
export JAVA_HOME=/usr/local/java
export PATH=$PATH:"$JAVA_HOME/bin"
END
[root@MySQL-1 ~]# source /etc/profile
[root@MySQL-1 ~]# java -version

Installing DataX

[root@MySQL-1 ~]# wget http://datax-opensource.oss-cn-hangzhou.aliyuncs.com/datax.tar.gz
[root@MySQL-1 ~]# tar zxf datax.tar.gz -C /usr/local/
[root@MySQL-1 ~]# rm -rf /usr/local/datax/plugin/*/._*   # delete hidden files

If hidden files are not removed, you may see errors like

[/usr/local/datax/plugin/reader/._drdsreader/plugin.json] not found. Please check your configuration file.

Verification

[root@MySQL-1 ~]# cd /usr/local/datax/bin
[root@MySQL-1 ~]# python datax.py ../job/job.json

Sample output shows job statistics such as total records, speed, and duration.

Basic DataX Usage

Run a simple stream reader/writer test:

[root@MySQL-1 ~]# python /usr/local/datax/bin/datax.py -r streamreader -w streamwriter

The command prints usage instructions and a JSON job template.

Creating a MySQL‑to‑MySQL Sync Job

Generate a template:

[root@MySQL-1 ~]# python /usr/local/datax/bin/datax.py -r mysqlreader -w mysqlwriter
{ ... JSON template ... }

Fill in a concrete JSON configuration (example install.json) with source/target JDBC URLs, credentials, column selection, pre‑SQL, session settings, and concurrency:

[root@MySQL-1 ~]# vim install.json
{
    "job": {
        "content": [{
            "reader": {
                "name": "mysqlreader",
                "parameter": {
                    "username": "root",
                    "password": "123123",
                    "column": ["*"],
                    "splitPk": "ID",
                    "connection": [{
                        "jdbcUrl": ["jdbc:mysql://192.168.1.1:3306/course-study?useUnicode=true&characterEncoding=utf8"],
                        "table": ["t_member"]
                    }]
                }
            },
            "writer": {
                "name": "mysqlwriter",
                "parameter": {
                    "column": ["*"],
                    "connection": [{
                        "jdbcUrl": "jdbc:mysql://192.168.1.2:3306/course-study?useUnicode=true&characterEncoding=utf8",
                        "table": ["t_member"]
                    }],
                    "password": "123123",
                    "preSql": ["truncate t_member"],
                    "session": ["set session sql_mode='ANSI'"],
                    "username": "root",
                    "writeMode": "insert"
                }
            }
        }],
        "setting": {"speed": {"channel": "5"}}
    }
}

Execute the job:

[root@MySQL-1 ~]# python /usr/local/datax/bin/datax.py install.json

Output shows successful synchronization of ~3 million records.

Incremental Synchronization

The only difference from full‑load is adding a where clause for filtering and adjusting preSql as needed.

Example incremental job ( where.json) filters records with ID <= 1888 and truncates the target table before loading:

{
    "job": {
        "content": [{
            "reader": {
                "name": "mysqlreader",
                "parameter": {
                    "username": "root",
                    "password": "123123",
                    "column": ["*"],
                    "splitPk": "ID",
                    "where": "ID <= 1888",
                    "connection": [{
                        "jdbcUrl": ["jdbc:mysql://192.168.1.1:3306/course-study?useUnicode=true&characterEncoding=utf8"],
                        "table": ["t_member"]
                    }]
                }
            },
            "writer": {
                "name": "mysqlwriter",
                "parameter": {
                    "column": ["*"],
                    "connection": [{
                        "jdbcUrl": "jdbc:mysql://192.168.1.2:3306/course-study?useUnicode=true&characterEncoding=utf8",
                        "table": ["t_member"]
                    }],
                    "password": "123123",
                    "preSql": ["truncate t_member"],
                    "session": ["set session sql_mode='ANSI'"],
                    "username": "root",
                    "writeMode": "insert"
                }
            }
        }],
        "setting": {"speed": {"channel": "5"}}
    }
}

Running the job yields statistics for the filtered 1 888 records.

For subsequent increments, modify the where clause (e.g., "ID > 1888 AND ID <= 2888") and remove the preSql truncation.

Final Note

The author encourages readers to like, share, and follow the public account for more technical content.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Linux MySQL open source DataX ETL Data Integration incremental sync

Written by

Code Ape Tech Column

Former Ant Group P8 engineer, pure technologist, sharing full‑stack Java, job interview and career advice through a column. Site: java-family.cn

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.