Using DataX for Efficient Data Synchronization Between MySQL Databases
This article explains how to employ Alibaba's open‑source DataX tool to perform fast, reliable full‑ and incremental data synchronization between MySQL instances, covering installation, framework design, job execution, and practical shell commands for Linux environments.
Introduction
Our company had a project with 50 million records, but because the reporting data was inaccurate and the business and reporting databases operated across different instances, SQL‑based synchronization was not feasible. Attempts with mysqldump and file‑based storage proved too slow, prompting us to evaluate DataX.
1. DataX Overview
DataX is the open‑source version of Alibaba Cloud DataWorks Data Integration, primarily used for offline data synchronization. It supports heterogeneous data sources such as relational databases (MySQL, Oracle, etc.), HDFS, Hive, ODPS, HBase, FTP, and more, providing stable and high‑performance data transfer.
Transforms complex mesh‑type sync links into a star topology, with DataX acting as the central transport carrier.
Adding a new data source only requires plugging it into DataX, enabling seamless synchronization with existing sources.
1. DataX 3.0 Framework Design
DataX adopts a Framework + Plugin architecture. Data source reading and writing are abstracted as Reader and Writer plugins, which are integrated into the overall synchronization framework.
Role
Function
Reader (采集模块)
Collects data from the source and forwards it to the Framework.
Writer (写入模块)
Pulls data from the Framework and writes it to the destination.
Framework (中间商)
Connects Reader and Writer, handling buffering, flow control, concurrency, and data conversion.
2. DataX 3.0 Core Architecture
When a DataX Job starts, it splits the job into multiple Tasks based on the source's partitioning strategy, enabling concurrent execution. The Scheduler assembles Tasks into TaskGroups according to the configured concurrency. Each Task runs a fixed pipeline: Reader → Channel → Writer. The Job monitors all TaskGroups and exits successfully once every group finishes.
DataX Job splits into Tasks according to source split keys.
Scheduler groups Tasks into TaskGroups based on the user‑defined channel count.
Each Task executes the Reader → Channel → Writer thread chain.
The Job monitors TaskGroups and exits when all complete (non‑zero exit code on failure).
2. Using DataX for Data Synchronization
Preparation :
JDK 1.8+ (recommended 1.8)
Python 2 or 3 (CentOS 7 already includes Python 2.7)
Apache Maven 3.x (only needed if compiling DataX; the tar package can be used directly)
1. Install JDK on Linux
[root@MySQL-1 ~]# ls
anaconda-ks.cfg jdk-8u181-linux-x64.tar.gz
[root@MySQL-1 ~]# tar zxf jdk-8u181-linux-x64.tar.gz
[root@DataX ~]# ls
anaconda-ks.cfg jdk1.8.0_181 jdk-8u181-linux-x64.tar.gz
[root@MySQL-1 ~]# mv jdk1.8.0_181 /usr/local/java
[root@MySQL-1 ~]# cat <
> /etc/profile
export JAVA_HOME=/usr/local/java
export PATH=$PATH:"$JAVA_HOME/bin"
END
[root@MySQL-1 ~]# source /etc/profile
[root@MySQL-1 ~]# java -versionPython 2.7 is pre‑installed on CentOS 7, so no additional installation is required.
2. Install DataX Software
[root@MySQL-1 ~]# wget http://datax-opensource.oss-cn-hangzhou.aliyuncs.com/datax.tar.gz
[root@MySQL-1 ~]# tar zxf datax.tar.gz -C /usr/local/
[root@MySQL-1 ~]# rm -rf /usr/local/datax/plugin/*/._* # delete hidden files (important)If hidden files are not removed, you may see errors such as:
[/usr/local/datax/plugin/reader/._drdsreader/plugin.json] does not exist. Please check your configuration file.3. Verify Installation
[root@MySQL-1 ~]# cd /usr/local/datax/bin
[root@MySQL-1 ~]# python datax.py ../job/job.json # test runTypical output shows job statistics and confirms successful execution.
4. Install MySQL (MariaDB) on Both Hosts
[root@MySQL-1 ~]# yum -y install mariadb mariadb-server mariadb-libs mariadb-devel
[root@MySQL-1 ~]# systemctl start mariadb
[root@MySQL-1 ~]# mysql_secure_installation
# (follow prompts to set root password, remove anonymous users, keep remote root login, remove test database, reload privileges)5. Prepare Data for Synchronization
MariaDB [(none)]> create database `course-study`;
MariaDB [(none)]> create table `course-study`.t_member(ID int,Name varchar(20),Email varchar(30));Grant full privileges to allow DataX to access both databases:
grant all privileges on *.* to root@'%' identified by '123123';
flush privileges;6. Create a Stored Procedure to Generate Test Data
DELIMITER $$
CREATE PROCEDURE test()
BEGIN
DECLARE A int default 1;
WHILE (A < 3000000) DO
INSERT INTO `course-study`.t_member VALUES(A, concat('LiSa',A), concat('LiSa',A,'@163.com'));
SET A = A + 1;
END WHILE;
END $$
DELIMITER ;
CALL test();7. Build MySQL‑to‑MySQL Sync Job
Generate a template:
[root@MySQL-1 ~]# python /usr/local/datax/bin/datax.py -r mysqlreader -w mysqlwriterSave the following JSON as install.json (replace placeholders with actual IPs, usernames, and passwords):
{
"job": {
"content": [{
"reader": {
"name": "mysqlreader",
"parameter": {
"username": "root",
"password": "123123",
"column": ["*"],
"splitPk": "ID",
"connection": [{
"jdbcUrl": ["jdbc:mysql://192.168.1.1:3306/course-study?useUnicode=true&characterEncoding=utf8"],
"table": ["t_member"]
}]
}
},
"writer": {
"name": "mysqlwriter",
"parameter": {
"username": "root",
"password": "123123",
"column": ["*"],
"connection": [{
"jdbcUrl": "jdbc:mysql://192.168.1.2:3306/course-study?useUnicode=true&characterEncoding=utf8",
"table": ["t_member"]
}],
"preSql": ["truncate t_member"],
"session": ["set session sql_mode='ANSI'"],
"writeMode": "insert"
}
}
}],
"setting": {"speed": {"channel": "5"}}
}
}Run the job:
[root@MySQL-1 ~]# python /usr/local/datax/bin/datax.py install.jsonThe output confirms that ~3 million records were transferred at ~2.57 MB/s with zero errors.
8. Incremental Synchronization
For incremental sync, add a where clause to the reader configuration and remove the preSql that truncates the target table. Example where.json :
{
"job": {
"content": [{
"reader": {
"name": "mysqlreader",
"parameter": {
"username": "root",
"password": "123123",
"column": ["*"],
"splitPk": "ID",
"where": "ID <= 1888",
"connection": [{
"jdbcUrl": ["jdbc:mysql://192.168.1.1:3306/course-study?useUnicode=true&characterEncoding=utf8"],
"table": ["t_member"]
}]
}
},
"writer": {
"name": "mysqlwriter",
"parameter": {
"username": "root",
"password": "123123",
"column": ["*"],
"connection": [{
"jdbcUrl": "jdbc:mysql://192.168.1.2:3306/course-study?useUnicode=true&characterEncoding=utf8",
"table": ["t_member"]
}],
"preSql": ["truncate t_member"],
"session": ["set session sql_mode='ANSI'"],
"writeMode": "insert"
}
}
}],
"setting": {"speed": {"channel": "5"}}
}
}Running this job syncs only the rows that satisfy the where condition (e.g., 1 888 records). Subsequent incremental runs can adjust the where clause (e.g., ID > 1888 AND ID <= 2888 ) and omit the preSql step.
In summary, DataX provides a fast, extensible framework for both full‑ and incremental data synchronization across heterogeneous databases, making it suitable for large‑scale data migration and ongoing ETL pipelines.
Architecture Digest
Focusing on Java backend development, covering application architecture from top-tier internet companies (high availability, high performance, high stability), big data, machine learning, Java architecture, and other popular fields.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.