Master DataX: Fast Offline Data Sync for MySQL without mysqldump
This guide explains how to use Alibaba's open‑source DataX tool to perform high‑performance offline synchronization between heterogeneous MySQL databases, covering installation, framework design, job configuration, full‑ and incremental sync, and practical command‑line examples.
Preface
A project with 50 million rows required reliable data synchronization between a business database and a reporting database, but cross‑database SQL queries were impractical.
Initial ideas such as mysqldump or simple file storage proved too slow or unreliable.
What Is DataX?
DataX is the open‑source version of Alibaba Cloud DataWorks data integration, designed for offline data synchronization across heterogeneous sources such as MySQL, Oracle, HDFS, Hive, ODPS, HBase, FTP, etc.
It transforms complex mesh‑like sync topologies into a star‑shaped architecture, where DataX acts as the middle‑man connecting various data sources.
DataX 3.0 Framework Design
DataX follows a Framework + Plugin model. Readers and Writers are implemented as plugins that plug into the central Framework.
The core components are:
Reader (采集模块) : collects data from a source and sends it to the Framework.
Writer (写入模块) : pulls data from the Framework and writes it to the target.
Framework (中间商) : connects Reader and Writer, handling buffering, flow control, concurrency, and data conversion.
DataX 3.0 Core Architecture
A DataX job is the unit of work. When a job starts, it spawns a process that manages the whole sync lifecycle, including data cleaning, task splitting, and TaskGroup management.
DataX splits the job into multiple Task instances based on the source’s split strategy.
The Scheduler assembles Tasks into TaskGroup according to the configured concurrency.
Each Task launches a Reader → Channel → Writer pipeline.
The Job monitors all TaskGroups and exits successfully when every group finishes (non‑zero exit code indicates failure).
Using DataX for Data Synchronization
Prerequisites
JDK 1.8 or higher (recommended 1.8)
Python 2 or 3 (CentOS 7 already includes Python 2.7)
Apache Maven 3.x (only needed for compiling DataX; binary tarball can be used without Maven)
Install JDK
# ls
anaconda-ks.cfg jdk-8u181-linux-x64.tar.gz
# tar zxf jdk-8u181-linux-x64.tar.gz
# mv jdk1.8.0_181 /usr/local/java
# echo "export JAVA_HOME=/usr/local/java" >> /etc/profile
# echo "export PATH=\$PATH:\"\$JAVA_HOME/bin\"" >> /etc/profile
# source /etc/profile
# java -versionCentOS 7 already provides Python 2.7, so no extra installation is required.
Install DataX
# wget http://datax-opensource.oss-cn-hangzhou.aliyuncs.com/datax.tar.gz
# tar zxf datax.tar.gz -C /usr/local/
# rm -rf /usr/local/datax/plugin/*/._* # delete hidden filesVerify installation:
# cd /usr/local/datax/bin
# python datax.py ../job/job.json # should print job infoBasic DataX Usage
Show the template for a stream reader/writer:
# python /usr/local/datax/bin/datax.py -r streamreader -w streamwriterThe command prints usage instructions and a JSON skeleton.
Installing MySQL on Two Hosts
# yum -y install mariadb mariadb-server mariadb-libs mariadb-devel
# systemctl start mariadb
# mysql_secure_installation # set root password, remove anonymous users, etc.Create identical tables on both hosts:
CREATE DATABASE `course-study`;
CREATE TABLE `course-study`.t_member(ID INT, Name VARCHAR(20), Email VARCHAR(30));Grant full privileges for DataX:
GRANT ALL PRIVILEGES ON *.* TO 'root'@'%' IDENTIFIED BY '123123';
FLUSH PRIVILEGES;Full‑Volume Sync with DataX
Generate a MySQL‑to‑MySQL job template:
# python /usr/local/datax/bin/datax.py -r mysqlreader -w mysqlwriter
{
"job": {
"content": [{
"reader": {"name": "mysqlreader", "parameter": {"column": [], "connection": [{"jdbcUrl": [], "table": []}], "username": "", "password": "", "where": ""}},
"writer": {"name": "mysqlwriter", "parameter": {"column": [], "connection": [{"jdbcUrl": "", "table": []}], "username": "", "password": "", "preSql": [], "session": [], "writeMode": ""}}
}],
"setting": {"speed": {"channel": ""}}
}
}Fill in the JSON with actual connection strings, credentials, and optional pre‑SQL (e.g., truncate t_member).
Run the job:
# python /usr/local/datax/bin/datax.py install.jsonSample output shows processing of ~3 million records at ~2.5 MB/s.
Incremental Sync
Incremental sync only differs by adding a where clause to filter rows.
{
"job": {
"content": [{
"reader": {"name": "mysqlreader", "parameter": {"username": "root", "password": "123123", "column": ["*"], "splitPk": "ID", "where": "ID <= 1888", "connection": [{"jdbcUrl": ["jdbc:mysql://192.168.1.1:3306/course-study?useUnicode=true&characterEncoding=utf8"], "table": ["t_member"]}] }},
"writer": {"name": "mysqlwriter", "parameter": {"column": ["*"], "connection": [{"jdbcUrl": "jdbc:mysql://192.168.1.2:3306/course-study?useUnicode=true&characterEncoding=utf8", "table": ["t_member"]}], "username": "root", "password": "123123", "writeMode": "insert"}}
}],
"setting": {"speed": {"channel": "5"}}
}
}Run the incremental job similarly; the output confirms only the filtered rows were transferred.
Key Takeaways
DataX provides a robust, star‑topology solution for large‑scale offline data sync.
Full sync works well for initial loads; incremental sync requires proper where conditions and may need to remove destructive preSql statements.
The framework’s plug‑in architecture makes it easy to extend to other data sources.
Author: 愿许浪尽天涯 (source: blog.csdn.net/weixin_46902396/article/details/121904705)
Java Backend Technology
Focus on Java-related technologies: SSM, Spring ecosystem, microservices, MySQL, MyCat, clustering, distributed systems, middleware, Linux, networking, multithreading. Occasionally cover DevOps tools like Jenkins, Nexus, Docker, and ELK. Also share technical insights from time to time, committed to Java full-stack development!
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
