Big Data 14 min read

How to Sync Massive MySQL Data with Alibaba DataX – Step‑by‑Step Guide

Facing a 50‑million‑row project with inaccurate reports and cross‑database operations, this guide explains why mysqldump and simple storage methods fail, introduces Alibaba’s open‑source DataX middleware, details its architecture, installation, and step‑by‑step configurations for full and incremental MySQL data synchronization.

Su San Talks Tech
Su San Talks Tech
Su San Talks Tech
How to Sync Massive MySQL Data with Alibaba DataX – Step‑by‑Step Guide

When a project has 50 million rows and the reporting data is inaccurate, traditional mysqldump or simple storage methods are too slow and often ineffective.

DataX, the open‑source version of Alibaba DataWorks Data Integration, provides efficient offline synchronization for heterogeneous data sources such as MySQL, Oracle, HDFS, Hive, ODPS, HBase, FTP, and more. It transforms complex mesh synchronization topologies into a star‑shaped data flow, acting as a middle‑man between readers and writers.

DataX 3.0 Framework Design

DataX follows a Framework + Plugin architecture. The Reader and Writer plugins abstract data source reading and writing, while the Framework handles buffering, flow control, concurrency, and data conversion.

The core components are:

Reader (采集模块) : collects data from a source and sends it to the Framework.

Writer (写入模块) : pulls data from the Framework and writes it to the target.

Framework (中间商) : connects Reader and Writer, managing channels, flow control, and conversion.

DataX Job Execution

A DataX job is split into multiple Tasks, grouped into TaskGroups according to the configured concurrency. Each Task runs a Reader → Channel → Writer pipeline. The Job monitors TaskGroups and exits with a non‑zero value on failure.

DataX Job splits the work into Tasks based on source split strategy.

Scheduler assembles Tasks into TaskGroups according to the user‑defined concurrency.

Each Task launches a Reader → Channel → Writer thread chain.

The Job waits for all TaskGroups to finish; a non‑zero exit code indicates errors.

Using DataX for MySQL Synchronization

Prerequisites: JDK 1.8+, Python 2/3, Apache Maven 3.x (for compiling DataX). Install JDK, extract DataX, and verify the installation.

# ls
anaconda-ks.cfg  jdk-8u181-linux-x64.tar.gz
# tar zxf jdk-8u181-linux-x64.tar.gz
# mv jdk1.8.0_181 /usr/local/java
# echo "export JAVA_HOME=/usr/local/java" >> /etc/profile
# source /etc/profile
# java -version

On CentOS 7 the system already provides Python 2.7, so no extra installation is needed.

Install DataX on Linux

# wget http://datax-opensource.oss-cn-hangzhou.aliyuncs.com/datax.tar.gz
# tar zxf datax.tar.gz -C /usr/local/
# rm -rf /usr/local/datax/plugin/*/._*

Validate the installation with:

# cd /usr/local/datax/bin
# python datax.py ../job/job.json

Prepare MySQL Databases

Create the same database and table on both hosts:

CREATE DATABASE `course-study`;
CREATE TABLE `course-study`.t_member(ID int, Name varchar(20), Email varchar(30));

Grant privileges for DataX to access the databases:

grant all privileges on *.* to root@'%' identified by '123123';
flush privileges;

Full‑Volume Synchronization

Generate a MySQL‑to‑MySQL template, edit the JSON job file with connection details, and run:

# python /usr/local/datax/bin/datax.py install.json

The job logs show total records, speed, and success status.

Incremental Synchronization

Incremental sync is achieved by adding a where clause to the reader configuration and optionally a preSql statement for the target.

"where": "ID <= 1888",
"preSql": ["truncate t_member"]

Running the job with this configuration syncs only the filtered rows.

DataX thus provides a flexible, high‑performance solution for both full and incremental data synchronization across heterogeneous databases.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

mysqldata synchronizationDataXETLIncremental Sync
Su San Talks Tech
Written by

Su San Talks Tech

Su San, former staff at several leading tech companies, is a top creator on Juejin and a premium creator on CSDN, and runs the free coding practice site www.susan.net.cn.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.