Master Data Synchronization with Alibaba DataX: From Installation to Incremental Sync
This guide explains how to use Alibaba's open‑source DataX tool to synchronize large MySQL datasets, covering the tool’s architecture, installation on Linux, job configuration with JSON, full‑load and incremental sync examples, and performance results, all without relying on mysqldump or manual storage methods.
Background
Company needed to synchronize 50 million rows between a business database and a reporting database; traditional methods such as mysqldump or file‑based storage were too slow and could produce inconsistent data.
What is DataX?
DataX is the open‑source version of Alibaba Cloud DataWorks data integration. It provides high‑performance offline synchronization for heterogeneous sources such as MySQL, Oracle, HDFS, Hive, ODPS, HBase, FTP and others.
Architecture
DataX follows a Framework + Plugin model. Reader plugins collect data from a source, Writer plugins write data to a target, and the central Framework coordinates the flow, buffering, flow‑control, concurrency and data conversion.
Reader – collects data from the source and pushes it to the Framework.
Writer – pulls data from the Framework and writes it to the destination.
Framework – connects Reader and Writer, handling buffering, flow control, concurrency and data conversion.
Job Scheduling
When a job starts, DataX splits it into multiple Task units according to the source’s split strategy.
Tasks are grouped into TaskGroup objects based on the user‑defined concurrency level.
Each Task launches a Reader, streams data through a Channel to a Writer thread.
The Job monitors all TaskGroups; it exits with code 0 on success, non‑zero on failure.
Installation on Linux
JDK 1.8 or higher (recommended 1.8)
Python 2 or 3 (CentOS 7 includes Python 2.7)
Apache Maven 3.x (only needed to compile DataX; the binary tarball does not require Maven)
Download and extract DataX:
# wget http://datax-opensource.oss-cn-hangzhou.aliyuncs.com/datax.tar.gz
# tar zxf datax.tar.gz -C /usr/local/
# rm -rf /usr/local/datax/plugin/*/._* # delete hidden files to avoid errorsVerify the installation:
# cd /usr/local/datax/bin
# python datax.py ../job/job.json # displays version info and usagePreparing MySQL Databases
Install MariaDB (compatible with MySQL) on both hosts, start the service and secure the installation.
# yum -y install mariadb mariadb-server mariadb-libs mariadb-devel
# systemctl start mariadb
# mysql_secure_installation # set root password, keep remote root login enabledCreate the test database and table on each host:
CREATE DATABASE `course-study`;
CREATE TABLE `course-study`.t_member (ID INT, Name VARCHAR(20), Email VARCHAR(30));
GRANT ALL PRIVILEGES ON *.* TO 'root'@'%' IDENTIFIED BY '123123';
FLUSH PRIVILEGES;Optional stored procedure to generate millions of rows for testing:
DELIMITER $$
CREATE PROCEDURE test()
BEGIN
DECLARE A INT DEFAULT 1;
WHILE (A < 3000000) DO
INSERT INTO `course-study`.t_member VALUES (A, CONCAT('LiSa', A), CONCAT('LiSa', A, '@163.com'));
SET A = A + 1;
END WHILE;
END $$
DELIMITER ;
CALL test();DataX Job Configuration
Jobs are defined in JSON files. Below are minimal templates.
Stream Reader / Writer Template
{
"job": {
"content": [{
"reader": {"name": "streamreader", "parameter": {}},
"writer": {"name": "streamwriter", "parameter": {"encoding": "", "print": true}}
}],
"setting": {"speed": {"channel": ""}}
}
}MySQL Full‑Load Template
{
"job": {
"content": [{
"reader": {
"name": "mysqlreader",
"parameter": {
"username": "root",
"password": "123123",
"column": ["*"],
"splitPk": "ID",
"connection": [{
"jdbcUrl": ["jdbc:mysql://192.168.1.1:3306/course-study?useUnicode=true&characterEncoding=utf8"],
"table": ["t_member"]
}]
}
},
"writer": {
"name": "mysqlwriter",
"parameter": {
"username": "root",
"password": "123123",
"column": ["*"],
"preSql": ["TRUNCATE t_member"],
"session": ["SET SESSION sql_mode='ANSI'"],
"writeMode": "insert",
"connection": [{
"jdbcUrl": ["jdbc:mysql://192.168.1.2:3306/course-study?useUnicode=true&characterEncoding=utf8"],
"table": ["t_member"]
}]
}
}
}],
"setting": {"speed": {"channel": "5"}}
}
}Run the job:
# python /usr/local/datax/bin/datax.py -r mysqlreader -w mysqlwriter /path/to/install.jsonSample output shows processing of 2,999,999 records at ~2.57 MB/s with zero errors.
Incremental Synchronization
The only difference from a full load is the addition of a where clause in the Reader and optional preSql in the Writer.
Incremental Job Example
{
"job": {
"content": [{
"reader": {
"name": "mysqlreader",
"parameter": {
"username": "root",
"password": "123123",
"column": ["*"],
"splitPk": "ID",
"where": "ID <= 1888",
"connection": [{
"jdbcUrl": ["jdbc:mysql://192.168.1.1:3306/course-study?useUnicode=true&characterEncoding=utf8"],
"table": ["t_member"]
}]
}
},
"writer": {
"name": "mysqlwriter",
"parameter": {
"username": "root",
"password": "123123",
"column": ["*"],
"preSql": ["TRUNCATE t_member"],
"session": ["SET SESSION sql_mode='ANSI'"],
"writeMode": "insert",
"connection": [{
"jdbcUrl": ["jdbc:mysql://192.168.1.2:3306/course-study?useUnicode=true&characterEncoding=utf8"],
"table": ["t_member"]
}]
}
}
}],
"setting": {"speed": {"channel": "5"}}
}
}Running this job synchronizes only the 1,888 rows matching the condition, completing in ~32 seconds at 62 records/s.
Key Takeaways
DataX provides fast, reliable offline synchronization for large data volumes across heterogeneous sources.
Full‑load jobs require only basic reader/writer configuration; incremental jobs add a where filter and optional pre‑SQL handling.
Correct installation, environment preparation and database permissions are essential for successful execution.
Java Architect Handbook
Focused on Java interview questions and practical article sharing, covering algorithms, databases, Spring Boot, microservices, high concurrency, JVM, Docker containers, and ELK-related knowledge. Looking forward to progressing together with you.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
