Master DataX: Efficient Offline Data Sync for Heterogeneous Sources
This guide walks through the challenges of synchronizing massive datasets across heterogeneous databases, introduces Alibaba's open‑source DataX tool, explains its framework‑plugin architecture, and provides step‑by‑step instructions—including environment setup, installation, job configuration, and both full and incremental MySQL synchronization—complete with code examples and performance metrics.
A project with 50 million records required reliable cross‑database synchronization, but traditional mysqldump or file‑based approaches proved too slow and inaccurate.
DataX Overview
DataX is the open‑source version of Alibaba Cloud DataWorks Data Integration, designed for offline data synchronization between heterogeneous sources such as MySQL, Oracle, HDFS, Hive, ODPS, HBase, and FTP. It abstracts source and target connections into a star‑shaped data pipeline, simplifying the addition of new data sources.
DataX 3.0 Framework Design
DataX follows a Framework + Plugin model where data readers and writers are implemented as plugins. The core components are:
Reader : collects data from the source and sends it to the Framework.
Writer : receives data from the Framework and writes it to the destination.
Framework : connects Reader and Writer, handling buffering, flow control, concurrency, and data transformation.
DataX 3.0 Core Architecture
A synchronization job (Job) is split into multiple parallel tasks (Task) based on the source’s split strategy. The Scheduler groups tasks into TaskGroups, each launched by the Job. Execution proceeds as Reader → Channel → Writer, and the Job monitors TaskGroups until completion.
Job divides work into Tasks.
Scheduler assembles Tasks into TaskGroups accordinglive to the configured concurrency.
Each Task starts a Reader, passes data through a Channel, and invokes a Writer.
Job exits successfully when all TaskGroups finish (non‑zero exit code indicates failure).
Preparing the Environment
JDK 1.8 or higher (recommended 1.8)
Python 2 or 3
Apache Maven 3.x (only needed for compiling DataX from source)
Example host table:
Hostname
OS
IP
Package
MySQL-1
CentOS 7.4
192.168.1.1
jdk-8u181-linux-x64.tar.gz datax.tar.gzMySQL-2
CentOS 7.4
192.168.1.2
Installing JDK
<code># ls
anaconda-ks.cfg jdk-8u181-linux-x64.tar.gz
# tar zxf jdk-8u181-linux-x64.tar.gz
# mv jdk1.8.0_181 /usr/local/java
# cat <<END >> /etc/profile
export JAVA_HOME=/usr/local/java
export PATH=$PATH:"$JAVA_HOME/bin"
END
# source /etc/profile
# java -version</code>CentOS 7 already includes Python 2.7, so no additional installation is required.
Installing DataX on Linux
<code># wget http://datax-opensource.oss-cn-hangzhou.aliyuncs.com/datax.tar.gz
# tar zxf datax.tar.gz -C /usr/local/
# rm -rf /usr/local/datax/plugin/*/._*
# cd /usr/local/datax/bin
# python datax.py ../job/job.json # verify installation</code>Successful verification prints job statistics such as total records, speed, and duration.
Using DataX for Data Synchronization
DataX jobs are defined in JSON. A basic template for a stream reader/writer looks like:
<code>{
"job": {
"content": [
{
"reader": {"name": "streamreader", "parameter": {}},
"writer": {"name": "streamwriter", "parameter": {"print": true}}
}
],
"setting": {"speed": {"channel": ""}}
}
}</code>For MySQL‑to‑MySQL full synchronization, create a JSON job specifying the source and target connections, columns, and optional pre‑SQL statements (e.g.,
truncate t_member). Example execution:
<code># python /usr/local/datax/bin/datax.py install.json</code>The job logs show total records (e.g., 2,999,999), throughput (≈2.57 MB/s), and completion time.
Incremental Synchronization
Incremental sync is achieved by adding a
whereclause to the reader configuration and optionally removing any
preSqlthat would truncate the target table.
<code>{
"job": {
"content": [
{
"reader": {
"name": "mysqlreader",
"parameter": {
"username": "root",
"password": "123123",
"column": ["*"],
"splitPk": "ID",
"where": "ID <= 1888",
"connection": [{"jdbcUrl": ["jdbc:mysql://192.168.1.1:3306/course-study?useUnicode=true&characterEncoding=utf8"], "table": ["t_member"]}]
}
},
"writer": {
"name": "mysqlwriter",
"parameter": {
"column": ["*"],
"connection": [{"jdbcUrl": "jdbc:mysql://192.168.1.2:3306/course-study?useUnicode=true&characterEncoding=utf8", "table": ["t_member"]}],
"password": "123123",
"preSql": ["truncate t_member"],
"session": ["set session sql_mode='ANSI'"],
"username": "root",
"writeMode": "insert"
}
}
}
],
"setting": {"speed": {"channel": "5"}}
}
}</code>Running this job synchronizes only the rows matching the
wherecondition, allowing efficient incremental updates.
macrozheng
Dedicated to Java tech sharing and dissecting top open-source projects. Topics include Spring Boot, Spring Cloud, Docker, Kubernetes and more. Author’s GitHub project “mall” has 50K+ stars.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.