Using Alibaba DataX for Offline Data Synchronization and Incremental Sync
This article introduces Alibaba DataX, explains its architecture and role in offline heterogeneous data synchronization, provides step‑by‑step Linux installation, demonstrates full‑load and incremental MySQL‑to‑MySQL sync with JSON job templates, and shares practical tips for handling large data volumes.
DataX Overview
DataX is the open‑source version of Alibaba Cloud DataWorks data integration, primarily used for offline synchronization of heterogeneous data sources such as relational databases (MySQL, Oracle), HDFS, Hive, ODPS, HBase, FTP, etc.
Framework Design
DataX follows a Framework + Plugin architecture. Data sources are accessed via Reader plugins, destinations via Writer plugins, and the Framework connects Readers and Writers, handling buffering, flow control, concurrency, and data conversion.
Role
Function
ReaderCollects data from the source and sends it to the Framework.
WriterPulls data from the Framework and writes it to the target.
FrameworkLinks Reader and Writer, acting as the data transmission channel.
Core Architecture
A DataX job is divided into Tasks, which are grouped into TaskGroups for concurrent execution. The Job monitors TaskGroups and exits successfully when all complete.
Preparing the Environment
Required software:
JDK 1.8+
Python 2 or 3
Apache Maven 3.x (only for compiling DataX; the tar package does not need Maven)
Example host configuration:
Hostname
OS
IP
Package
MySQL-1
CentOS 7.4
192.168.1.1
jdk-8u181-linux-x64.tar.gz datax.tar.gzMySQL-2
CentOS 7.4
192.168.1.2
Installing JDK
[root@MySQL-1 ~]# ls
anaconda-ks.cfg jdk-8u181-linux-x64.tar.gz
[root@MySQL-1 ~]# tar zxf jdk-8u181-linux-x64.tar.gz
[root@DataX ~]# ls
anaconda-ks.cfg jdk1.8.0_181 jdk-8u181-linux-x64.tar.gz
[root@MySQL-1 ~]# mv jdk1.8.0_181 /usr/local/java
[root@MySQL-1 ~]# cat <
> /etc/profile
export JAVA_HOME=/usr/local/java
export PATH=$PATH:"$JAVA_HOME/bin"
END
[root@MySQL-1 ~]# source /etc/profile
[root@MySQL-1 ~]# java -versionInstalling DataX
[root@MySQL-1 ~]# wget http://datax-opensource.oss-cn-hangzhou.aliyuncs.com/datax.tar.gz
[root@MySQL-1 ~]# tar zxf datax.tar.gz -C /usr/local/
[root@MySQL-1 ~]# rm -rf /usr/local/datax/plugin/*/._* # delete hidden filesIf hidden files are not removed, you may see errors like [/usr/local/datax/plugin/reader/._drdsreader/plugin.json] not found. Please check your configuration file.
Verification
[root@MySQL-1 ~]# cd /usr/local/datax/bin
[root@MySQL-1 ~]# python datax.py ../job/job.jsonSample output shows job statistics such as total records, speed, and duration.
Basic DataX Usage
Run a simple stream reader/writer test:
[root@MySQL-1 ~]# python /usr/local/datax/bin/datax.py -r streamreader -w streamwriterThe command prints usage instructions and a JSON job template.
Creating a MySQL‑to‑MySQL Sync Job
Generate a template:
[root@MySQL-1 ~]# python /usr/local/datax/bin/datax.py -r mysqlreader -w mysqlwriter
{ ... JSON template ... }Fill in a concrete JSON configuration (example install.json ) with source/target JDBC URLs, credentials, column selection, pre‑SQL, session settings, and concurrency:
[root@MySQL-1 ~]# vim install.json
{
"job": {
"content": [{
"reader": {
"name": "mysqlreader",
"parameter": {
"username": "root",
"password": "123123",
"column": ["*"],
"splitPk": "ID",
"connection": [{
"jdbcUrl": ["jdbc:mysql://192.168.1.1:3306/course-study?useUnicode=true&characterEncoding=utf8"],
"table": ["t_member"]
}]
}
},
"writer": {
"name": "mysqlwriter",
"parameter": {
"column": ["*"],
"connection": [{
"jdbcUrl": "jdbc:mysql://192.168.1.2:3306/course-study?useUnicode=true&characterEncoding=utf8",
"table": ["t_member"]
}],
"password": "123123",
"preSql": ["truncate t_member"],
"session": ["set session sql_mode='ANSI'"],
"username": "root",
"writeMode": "insert"
}
}
}],
"setting": {"speed": {"channel": "5"}}
}
}Execute the job:
[root@MySQL-1 ~]# python /usr/local/datax/bin/datax.py install.jsonOutput shows successful synchronization of ~3 million records.
Incremental Synchronization
The only difference from full‑load is adding a where clause for filtering and adjusting preSql as needed.
Example incremental job ( where.json ) filters records with ID <= 1888 and truncates the target table before loading:
{
"job": {
"content": [{
"reader": {
"name": "mysqlreader",
"parameter": {
"username": "root",
"password": "123123",
"column": ["*"],
"splitPk": "ID",
"where": "ID <= 1888",
"connection": [{
"jdbcUrl": ["jdbc:mysql://192.168.1.1:3306/course-study?useUnicode=true&characterEncoding=utf8"],
"table": ["t_member"]
}]
}
},
"writer": {
"name": "mysqlwriter",
"parameter": {
"column": ["*"],
"connection": [{
"jdbcUrl": "jdbc:mysql://192.168.1.2:3306/course-study?useUnicode=true&characterEncoding=utf8",
"table": ["t_member"]
}],
"password": "123123",
"preSql": ["truncate t_member"],
"session": ["set session sql_mode='ANSI'"],
"username": "root",
"writeMode": "insert"
}
}
}],
"setting": {"speed": {"channel": "5"}}
}
}Running the job yields statistics for the filtered 1 888 records.
For subsequent increments, modify the where clause (e.g., "ID > 1888 AND ID <= 2888" ) and remove the preSql truncation.
Final Note
The author encourages readers to like, share, and follow the public account for more technical content.
Code Ape Tech Column
Former Ant Group P8 engineer, pure technologist, sharing full‑stack Java, job interview and career advice through a column. Site: java-family.cn
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.