How to Sync Massive MySQL Data with Alibaba DataX – Step‑by‑Step Guide
Facing a 50‑million‑row project with inaccurate reports and cross‑database operations, this guide explains why mysqldump and simple storage methods fail, introduces Alibaba’s open‑source DataX middleware, details its architecture, installation, and step‑by‑step configurations for full and incremental MySQL data synchronization.
When a project has 50 million rows and the reporting data is inaccurate, traditional mysqldump or simple storage methods are too slow and often ineffective.
DataX, the open‑source version of Alibaba DataWorks Data Integration, provides efficient offline synchronization for heterogeneous data sources such as MySQL, Oracle, HDFS, Hive, ODPS, HBase, FTP, and more. It transforms complex mesh synchronization topologies into a star‑shaped data flow, acting as a middle‑man between readers and writers.
DataX 3.0 Framework Design
DataX follows a Framework + Plugin architecture. The Reader and Writer plugins abstract data source reading and writing, while the Framework handles buffering, flow control, concurrency, and data conversion.
The core components are:
Reader (采集模块) : collects data from a source and sends it to the Framework.
Writer (写入模块) : pulls data from the Framework and writes it to the target.
Framework (中间商) : connects Reader and Writer, managing channels, flow control, and conversion.
DataX Job Execution
A DataX job is split into multiple Tasks, grouped into TaskGroups according to the configured concurrency. Each Task runs a Reader → Channel → Writer pipeline. The Job monitors TaskGroups and exits with a non‑zero value on failure.
DataX Job splits the work into Tasks based on source split strategy.
Scheduler assembles Tasks into TaskGroups according to the user‑defined concurrency.
Each Task launches a Reader → Channel → Writer thread chain.
The Job waits for all TaskGroups to finish; a non‑zero exit code indicates errors.
Using DataX for MySQL Synchronization
Prerequisites: JDK 1.8+, Python 2/3, Apache Maven 3.x (for compiling DataX). Install JDK, extract DataX, and verify the installation.
# ls
anaconda-ks.cfg jdk-8u181-linux-x64.tar.gz
# tar zxf jdk-8u181-linux-x64.tar.gz
# mv jdk1.8.0_181 /usr/local/java
# echo "export JAVA_HOME=/usr/local/java" >> /etc/profile
# source /etc/profile
# java -versionOn CentOS 7 the system already provides Python 2.7, so no extra installation is needed.
Install DataX on Linux
# wget http://datax-opensource.oss-cn-hangzhou.aliyuncs.com/datax.tar.gz
# tar zxf datax.tar.gz -C /usr/local/
# rm -rf /usr/local/datax/plugin/*/._*Validate the installation with:
# cd /usr/local/datax/bin
# python datax.py ../job/job.jsonPrepare MySQL Databases
Create the same database and table on both hosts:
CREATE DATABASE `course-study`;
CREATE TABLE `course-study`.t_member(ID int, Name varchar(20), Email varchar(30));Grant privileges for DataX to access the databases:
grant all privileges on *.* to root@'%' identified by '123123';
flush privileges;Full‑Volume Synchronization
Generate a MySQL‑to‑MySQL template, edit the JSON job file with connection details, and run:
# python /usr/local/datax/bin/datax.py install.jsonThe job logs show total records, speed, and success status.
Incremental Synchronization
Incremental sync is achieved by adding a where clause to the reader configuration and optionally a preSql statement for the target.
"where": "ID <= 1888",
"preSql": ["truncate t_member"]Running the job with this configuration syncs only the filtered rows.
DataX thus provides a flexible, high‑performance solution for both full and incremental data synchronization across heterogeneous databases.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Su San Talks Tech
Su San, former staff at several leading tech companies, is a top creator on Juejin and a premium creator on CSDN, and runs the free coding practice site www.susan.net.cn.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
