How to Sync Massive MySQL Datasets Efficiently with DataX
This guide walks through the challenges of synchronizing tens of millions of records between heterogeneous MySQL databases, explains why traditional mysqldump or file‑based methods fail, and provides a step‑by‑step tutorial on installing, configuring, and using Alibaba's open‑source DataX tool for both full and incremental data synchronization.
Preface
Our project needs to sync fifty million records, but the reporting database is inaccurate and cross‑database operations prevent using plain SQL. Initial ideas like mysqldump or file‑based storage proved impractical because backup and sync take too long and may generate new data during the process.
mysqldump: both backup and sync are time‑consuming; new data may appear during backup, effectively making sync ineffective.
File storage: extremely slow—only about two thousand rows synced in three hours for small data sets.
After researching, we discovered DataX, a tool that offers fast and reliable data synchronization.
1. DataX Overview
DataX is the open‑source version of Alibaba Cloud DataWorks Data Integration, designed for offline data sync between heterogeneous sources such as MySQL, Oracle, HDFS, Hive, ODPS, HBase, FTP, etc.
It transforms complex mesh sync topologies into a star‑shaped data flow, acting as a middle‑layer that connects various data sources.
1. DataX 3.0 Framework Design
DataX follows a Framework + Plugin architecture. Data source reading and writing are abstracted as Reader/Writer plugins that plug into the overall sync framework.
2. DataX 3.0 Core Architecture
A DataX job (Job) is the unit of work. When a Job starts, it spawns a process that handles data cleaning, task splitting, and TaskGroup management.
The Job splits into multiple small Tasks based on the source's split strategy, enabling concurrent execution.
Each Task is managed by a TaskGroup, which launches a pipeline: Reader → Channel → Writer.
The Job monitors all TaskGroups; when they all finish, the Job exits successfully (non‑zero exit code indicates failure).
2. Using DataX for Data Synchronization
Preparation
JDK 1.8+
Python 2 or 3
Apache Maven 3.x is required only when compiling DataX from source.
[root@MySQL-1 ~]# ls
anaconda-ks.cfg jdk-8u181-linux-x64.tar.gz
[root@MySQL-1 ~]# tar zxf jdk-8u181-linux-x64.tar.gz
[root@DataX ~]# ls
anaconda-ks.cfg jdk1.8.0_181 jdk-8u181-linux-x64.tar.gz
[root@MySQL-1 ~]# mv jdk1.8.0_181 /usr/local/java
[root@MySQL-1 ~]# cat <<END >> /etc/profile
export JAVA_HOME=/usr/local/java
export PATH=$PATH:"$JAVA_HOME/bin"
END
[root@MySQL-1 ~]# source /etc/profile
[root@MySQL-1 ~]# java -versionCentOS 7 already includes Python 2.7, so no additional installation is needed.
1. Install DataX on Linux
[root@MySQL-1 ~]# wget http://datax-opensource.oss-cn-hangzhou.aliyuncs.com/datax.tar.gz
[root@MySQL-1 ~]# tar zxf datax.tar.gz -C /usr/local/
[root@MySQL-1 ~]# rm -rf /usr/local/datax/plugin/*/._* # delete hidden files (important)If hidden files are not removed, you may see errors like
[/usr/local/datax/plugin/reader/._drdsreader/plugin.json] does not exist. Please check your configuration file.Verification:
[root@MySQL-1 ~]# cd /usr/local/datax/bin
[root@MySQL-1 ~]# python datax.py ../job/job.json # verify installationTypical output shows records processed and speed metrics.
2. Basic DataX Usage
[root@MySQL-1 ~]# python /usr/local/datax/bin/datax.py -r streamreader -w streamwriterRunning this prints the DataX version and usage instructions.
3. Install MySQL Databases
[root@MySQL-1 ~]# yum -y install mariadb mariadb-server mariadb-libs mariadb-devel
[root@MySQL-1 ~]# systemctl start mariadb
[root@MySQL-1 ~]# mysql_secure_installationSet root password, remove anonymous users, allow remote root login, and delete test databases as needed.
1) Prepare Data for Sync (create identical tables on both hosts)
MariaDB [(none)]> create database course-study;
Query OK, 1 row affected (0.00 sec)
MariaDB [(none)]> create table course-study.t_member(ID int,Name varchar(20),Email varchar(30));
Query OK, 0 rows affected (0.00 sec)Grant privileges for DataX to access both databases:
grant all privileges on *.* to root@'%' identified by '123123';
flush privileges;2) Create a stored procedure to generate test data
DELIMITER $$
CREATE PROCEDURE test()
BEGIN
DECLARE A int default 1;
WHILE (A < 3000000) DO
INSERT INTO course-study.t_member VALUES (A, CONCAT('LiSa',A), CONCAT('LiSa',A,'@163.com'));
SET A = A + 1;
END WHILE;
END $$
DELIMITER ;
CALL test();3) Sync MySQL to MySQL with DataX
1) Generate a template JSON for MySQL→MySQL
[root@MySQL-1 ~]# python /usr/local/datax/bin/datax.py -r mysqlreader -w mysqlwriter
{
"job": {
"content": [
{
"reader": {
"name": "mysqlreader",
"parameter": {
"column": [],
"connection": [{"jdbcUrl": [], "table": []}],
"username": "",
"password": "",
"where": ""
}
},
"writer": {
"name": "mysqlwriter",
"parameter": {
"column": [],
"connection": [{"jdbcUrl": "", "table": []}],
"username": "",
"password": "",
"preSql": [],
"session": [],
"writeMode": ""
}
}
}
],
"setting": {"speed": {"channel": ""}}
}
}2) Write the actual JSON configuration
{
"job": {
"content": [
{
"reader": {
"name": "mysqlreader",
"parameter": {
"username": "root",
"password": "123123",
"column": ["*"],
"splitPk": "ID",
"connection": [{
"jdbcUrl": ["jdbc:mysql://192.168.1.1:3306/course-study?useUnicode=true&characterEncoding=utf8"],
"table": ["t_member"]
}]
}
},
"writer": {
"name": "mysqlwriter",
"parameter": {
"column": ["*"],
"connection": [{
"jdbcUrl": "jdbc:mysql://192.168.1.2:3306/course-study?useUnicode=true&characterEncoding=utf8",
"table": ["t_member"]
}],
"username": "root",
"password": "123123",
"preSql": ["truncate t_member"],
"session": ["set session sql_mode='ANSI'"],
"writeMode": "insert"
}
}
}
],
"setting": {"speed": {"channel": "5"}}
}
}Run the job:
[root@MySQL-1 ~]# python /usr/local/datax/bin/datax.py install.jsonThe output shows total records processed, speed, and confirms successful synchronization.
4. Incremental Synchronization with DataX
The only difference between full and incremental sync is the use of the where clause to filter records.
1) Create a JSON for incremental sync
{
"job": {
"content": [
{
"reader": {
"name": "mysqlreader",
"parameter": {
"username": "root",
"password": "123123",
"column": ["*"],
"splitPk": "ID",
"where": "ID <= 1888",
"connection": [{
"jdbcUrl": ["jdbc:mysql://192.168.1.1:3306/course-study?useUnicode=true&characterEncoding=utf8"],
"table": ["t_member"]
}]
}
},
"writer": {
"name": "mysqlwriter",
"parameter": {
"column": ["*"],
"connection": [{
"jdbcUrl": "jdbc:mysql://192.168.1.2:3306/course-study?useUnicode=true&characterEncoding=utf8",
"table": ["t_member"]
}],
"username": "root",
"password": "123123",
"preSql": ["truncate t_member"],
"session": ["set session sql_mode='ANSI'"],
"writeMode": "insert"
}
}
}
],
"setting": {"speed": {"channel": "5"}}
}
}Run the incremental job and verify the target database contains only the filtered rows.
Selected Java Interview Questions
A professional Java tech channel sharing common knowledge to help developers fill gaps. Follow us!
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
