Using DataX for MySQL Data Synchronization: Full and Incremental Sync Guide
This article explains how to install DataX, configure MySQL readers and writers, and execute both full and incremental data synchronization jobs between two MySQL instances, providing step‑by‑step commands, JSON job templates, and troubleshooting tips for large‑scale data transfers.
Introduction
The author needed to synchronize 50 million rows between two MySQL databases where traditional SQL sync was impractical, and discovered that DataX offers fast, reliable offline data transfer.
What is DataX?
DataX is an open‑source version of Alibaba Cloud DataWorks for offline data integration, supporting heterogeneous sources such as MySQL, Oracle, HDFS, Hive, ODPS, HBase, and FTP. It abstracts source and target connections as Reader and Writer plugins within a framework‑plus‑plugin architecture.
DataX 3.0 Framework Design
DataX uses a Framework + Plugin model where each data source is implemented as a Reader plugin and each destination as a Writer plugin.
DataX Core Architecture
A Job represents a single synchronization task. The Job creates multiple Task instances, which are grouped into TaskGroup s for concurrent execution. The scheduler distributes tasks based on the configured channel count.
Installation Prerequisites
JDK 1.8+
Python 2 or 3
Apache Maven 3.x (only for compiling DataX from source)
Installing JDK and DataX
# Install JDK
[root@MySQL-1 ~]# tar zxf jdk-8u181-linux-x64.tar.gz
[root@MySQL-1 ~]# mv jdk1.8.0_181 /usr/local/java
# Set environment variables
[root@MySQL-1 ~]# cat >> /etc/profile <Using DataX for Data Synchronization
1. Full Synchronization
Create a JSON job file that defines a mysqlreader and mysqlwriter . Example snippet:
{
"job": {
"content": [{
"reader": {
"name": "mysqlreader",
"parameter": {
"username": "root",
"password": "123123",
"column": ["*"],
"splitPk": "ID",
"connection": [{
"jdbcUrl": ["jdbc:mysql://192.168.1.1:3306/course-study?useUnicode=true&characterEncoding=utf8"],
"table": ["t_member"]
}]
}
},
"writer": {
"name": "mysqlwriter",
"parameter": {
"username": "root",
"password": "123123",
"column": ["*"],
"connection": [{
"jdbcUrl": "jdbc:mysql://192.168.1.2:3306/course-study?useUnicode=true&characterEncoding=utf8",
"table": ["t_member"]
}],
"preSql": ["truncate t_member"],
"session": ["set session sql_mode='ANSI'"],
"writeMode": "insert"
}
}
}],
"setting": {"speed": {"channel": "5"}}
}
}Run the job with:
# python /usr/local/datax/bin/datax.py full_sync.jsonThe output shows total records, speed, and completion time.
2. Incremental Synchronization
Incremental sync differs only by adding a where clause to the reader configuration. Example:
{
"job": {
"content": [{
"reader": {
"name": "mysqlreader",
"parameter": {
"username": "root",
"password": "123123",
"column": ["*"],
"splitPk": "ID",
"where": "ID <= 1888",
"connection": [{
"jdbcUrl": ["jdbc:mysql://192.168.1.1:3306/course-study?useUnicode=true&characterEncoding=utf8"],
"table": ["t_member"]
}]
}
},
"writer": {
"name": "mysqlwriter",
"parameter": {
"username": "root",
"password": "123123",
"column": ["*"],
"connection": [{
"jdbcUrl": "jdbc:mysql://192.168.1.2:3306/course-study?useUnicode=true&characterEncoding=utf8",
"table": ["t_member"]
}],
"preSql": ["truncate t_member"],
"session": ["set session sql_mode='ANSI'"],
"writeMode": "insert"
}
}
}],
"setting": {"speed": {"channel": "5"}}
}
}Execute the job similarly; the log reports the number of records transferred based on the filter.
Key Takeaways
DataX converts complex data pipelines into simple star‑topology jobs.
Full sync is suitable for initial loads; incremental sync uses where clauses to transfer only changed data.
Properly configure MySQL privileges and remove hidden plugin files to avoid runtime errors.
Selected Java Interview Questions
A professional Java tech channel sharing common knowledge to help developers fill gaps. Follow us!
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.