Using Alibaba DataX 3.0 for MySQL Data Synchronization: Installation, Configuration, and Incremental Sync
This article introduces Alibaba DataX 3.0, explains its architecture and role‑based design, walks through Linux installation, JDK setup, MySQL preparation, and provides step‑by‑step examples of full‑load and incremental data synchronization between two MySQL instances using JSON job configurations and command‑line execution.
Introduction
A project with 50 million rows required reliable data synchronization between a business database and a reporting database, but traditional SQL or mysqldump approaches were too slow or inaccurate, prompting the use of DataX.
DataX Overview
DataX is the open‑source version of Alibaba Cloud DataWorks data integration, primarily used for offline synchronization of heterogeneous data sources such as MySQL, Oracle, HDFS, Hive, ODPS, HBase, FTP, and others.
It solves heterogeneous data source sync problems by converting complex mesh topologies into a star‑shaped data flow, where DataX acts as the middle‑man connecting various sources.
DataX 3.0 Architecture
DataX adopts a Framework + Plugin architecture. The data source reading and writing are abstracted as Reader/Writer plugins, which are managed by the central Framework.
Role
Function
Reader(采集模块)
Collects data from the source and sends it to the
Framework.
Writer(写入模块)
Pulls data from the
Frameworkand writes it to the destination.
Framework(中间商)
Connects
Readerand
Writer, handling buffering, flow‑control, concurrency, and data conversion.
Core Scheduling Process
DataX Job splits the whole job into multiple Task units according to the split strategy.
Scheduler groups Tasks into TaskGroup based on the configured concurrency.
Each Task runs a pipeline: Reader → Channel → Writer .
The Job monitors all TaskGroups and exits successfully when every group finishes (non‑zero exit code on error).
Installation and Preparation
JDK 1.8+ (recommended 1.8)
Python 2 or 3
Apache Maven 3.x (only needed for compiling DataX; binary tar package can be used without Maven)
Example of installing JDK on CentOS:
# tar zxf jdk-8u181-linux-x64.tar.gz
mv jdk1.8.0_181 /usr/local/java
cat >> /etc/profile <MySQL Setup
Install MariaDB on two hosts (CentOS 7.4) and start the service.
Create database and table: create database `course-study`; create table `course-study`.t_member(ID int, Name varchar(20), Email varchar(30));
Grant privileges for remote access: grant all privileges on *.* to root@'%' identified by '123123'; flush privileges;
Running a Full‑Load Job
Generate a template JSON with:
# python /usr/local/datax/bin/datax.py -r mysqlreader -w mysqlwriterFill in the connection details (example shown below) and execute:
# python /usr/local/datax/bin/datax.py install.jsonSample output indicates 2,999,999 records synchronized at ~2.57 MB/s.
Incremental Synchronization
The only difference from full‑load is adding a where clause to the reader configuration and optionally removing preSql statements.
{
"job": {
"content": [{
"reader": {
"name": "mysqlreader",
"parameter": {
"username": "root",
"password": "123123",
"column": ["*"],
"splitPk": "ID",
"where": "ID <= 1888",
"connection": [{
"jdbcUrl": ["jdbc:mysql://192.168.1.1:3306/course-study?useUnicode=true&characterEncoding=utf8"],
"table": ["t_member"]
}]
}
},
"writer": {
"name": "mysqlwriter",
"parameter": {
"column": ["*"],
"connection": [{
"jdbcUrl": "jdbc:mysql://192.168.1.2:3306/course-study?useUnicode=true&characterEncoding=utf8",
"table": ["t_member"]
}],
"username": "root",
"password": "123123",
"preSql": ["truncate t_member"],
"session": ["set session sql_mode='ANSI'"],
"writeMode": "insert"
}
}
}],
"setting": {"speed": {"channel": "5"}}
}
}Running the job produces 1,888 records (the defined incremental range) with the expected speed.
Key Takeaways
DataX provides a reliable way to sync large volumes of data between heterogeneous sources.
Full‑load jobs are straightforward; incremental jobs require only a where clause.
The Framework + Plugin model makes it easy to add new readers or writers without changing core logic.
Java Tech Enthusiast
Sharing computer programming language knowledge, focusing on Java fundamentals, data structures, related tools, Spring Cloud, IntelliJ IDEA... Book giveaways, red‑packet rewards and other perks await!
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.