Alibaba’s Open‑Source DataX: Fast, Easy Offline Data Synchronization
This article introduces Alibaba’s open‑source DataX tool, explains its framework‑plugin architecture for heterogeneous database sync, walks through Linux installation, job configuration, full‑ and incremental MySQL synchronization, and shares performance results and practical tips.
Architecture Overview
DataX follows a Framework + Plugin design. Data source read and write operations are abstracted as Reader and Writer plugins. A central Framework connects the plugins, handling buffering, flow control, concurrency, and data conversion.
Roles : Reader – collects data from the source and sends it to the Framework. Writer – pulls data from the Framework and writes it to the destination. Framework – acts as the transmission channel between Reader and Writer, managing core technical concerns such as buffering and flow control.
Job Execution Model
When a Job is submitted, DataX splits it into multiple Task instances according to the source’s split strategy. Tasks are grouped into TaskGroup s based on the configured concurrency (the channel setting). Each TaskGroup launches its Tasks in parallel. The job finishes when all TaskGroup s complete; a non‑zero exit code indicates failure.
Installation on Linux
JDK 1.8+ (recommended 1.8)
Python 2 or 3 (pre‑installed on CentOS 7)
Apache Maven 3.x (only required to compile DataX; the binary tar package can be used directly)
# wget http://datax-opensource.oss-cn-hangzhou.aliyuncs.com/datax.tar.gz
# tar zxf datax.tar.gz -C /usr/local/
# rm -rf /usr/local/datax/plugin/*/._* # remove hidden files generated by the archiveVerify the installation:
# cd /usr/local/datax/bin
# python datax.py ../job/job.json # test runSample output shows processing of 100 000 records at ~254 KB/s with zero errors.
Basic Usage
Generate a default job skeleton using the built‑in streamreader → streamwriter template:
# python /usr/local/datax/bin/datax.py -r streamreader -w streamwriterThe command prints a JSON job skeleton that can be saved and edited. Documentation links are printed in the output:
https://github.com/alibaba/DataX/blob/master/streamreader/doc/streamreader.md
https://github.com/alibaba/DataX/blob/master/streamwriter/doc/streamwriter.md
Full‑Copy MySQL Synchronization
Prepare two MySQL hosts (CentOS 7.4, IP 192.168.1.1 and 192.168.1.2) and install MariaDB:
# yum -y install mariadb mariadb-server mariadb-libs mariadb-devel
# systemctl start mariadb
# mysql_secure_installation # set root password to 123123Create identical schema on both hosts:
CREATE DATABASE `course-study`;
CREATE TABLE `course-study`.t_member(ID int, Name varchar(20), Email varchar(30));Grant full privileges to the root user on each host:
grant all privileges on *.* to root@'%' identified by '123123';
flush privileges;Generate a MySQL‑to‑MySQL job template:
# python /usr/local/datax/bin/datax.py -r mysqlreader -w mysqlwriterFill in the generated JSON (example install.json) with source and target JDBC URLs, tables, and credentials. The preSql step truncates the target table before loading.
{
"job": {
"content": [{
"reader": {
"name": "mysqlreader",
"parameter": {
"username": "root",
"password": "123123",
"column": ["*"],
"splitPk": "ID",
"connection": [{
"jdbcUrl": ["jdbc:mysql://192.168.1.1:3306/course-study?useUnicode=true&characterEncoding=utf8"],
"table": ["t_member"]
}]
}
},
"writer": {
"name": "mysqlwriter",
"parameter": {
"username": "root",
"password": "123123",
"column": ["*"],
"connection": [{
"jdbcUrl": ["jdbc:mysql://192.168.1.2:3306/course-study?useUnicode=true&characterEncoding=utf8"],
"table": ["t_member"]
}],
"preSql": ["truncate t_member"],
"session": ["set session sql_mode='ANSI'"],
"writeMode": "insert"
}
}
}],
"setting": {"speed": {"channel": "5"}}
}
}Run the job:
# python /usr/local/datax/bin/datax.py install.jsonResult: 2 999 999 records transferred in 42 s at 2.57 MB/s (≈75 k records/s) with zero errors.
Incremental Synchronization
The only difference from full copy is the addition of a where clause to filter rows. Example where.json:
{
"job": {
"content": [{
"reader": {
"name": "mysqlreader",
"parameter": {
"username": "root",
"password": "123123",
"column": ["*"],
"splitPk": "ID",
"where": "ID <= 1888",
"connection": [{
"jdbcUrl": ["jdbc:mysql://192.168.1.1:3306/course-study?useUnicode=true&characterEncoding=utf8"],
"table": ["t_member"]
}]
}
},
"writer": {
"name": "mysqlwriter",
"parameter": {
"username": "root",
"password": "123123",
"column": ["*"],
"connection": [{
"jdbcUrl": ["jdbc:mysql://192.168.1.2:3306/course-study?useUnicode=true&characterEncoding=utf8"],
"table": ["t_member"]
}],
"preSql": ["truncate t_member"],
"session": ["set session sql_mode='ANSI'"],
"writeMode": "insert"
}
}
}],
"setting": {"speed": {"channel": "5"}}
}
}Running the job yields 1 888 records transferred in 32 s at 1.61 KB/s (≈62 records/s) with no errors.
For subsequent incremental runs, adjust the where clause, e.g. "ID > 1888 AND ID <= 2888", and remove the preSql truncate step.
Key Observations
DataX can process millions of rows within seconds; the full‑copy example achieved ~75 k records/s.
Incremental sync is achieved solely by adding a where filter; the rest of the job definition remains unchanged.
Concurrency is controlled by the channel parameter inside setting.speed. In the examples, a channel value of 5 was used.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Architect's Guide
Dedicated to sharing programmer-architect skills—Java backend, system, microservice, and distributed architectures—to help you become a senior architect.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
