Big Data 15 min read

Master DataX: Efficient Data Synchronization for Massive MySQL Datasets

Learn how to overcome inaccurate reporting and cross-database challenges by using Alibaba’s open-source DataX tool to efficiently synchronize massive MySQL datasets, covering its architecture, job scheduling, installation, configuration, full- and incremental sync, and practical command-line examples.

macrozheng

May 12, 2025

Master DataX: Efficient Data Synchronization for Massive MySQL Datasets

Problem Overview

A project contains up to 50 million records, but reporting data is inaccurate and the business and reporting databases operate across different instances, making direct SQL synchronization impossible. Initial attempts using mysqldump or file‑based storage proved impractical because backups and syncs were time‑consuming and could generate new data during the process.

What is DataX

DataX is the open‑source version of Alibaba Cloud DataWorks data integration. It is designed primarily for offline synchronization of heterogeneous data sources such as relational databases (MySQL, Oracle), HDFS, Hive, ODPS, HBase, FTP, etc. By turning complex mesh sync topologies into a star‑shaped data link, DataX acts as a middle‑transport layer that connects various sources seamlessly.

DataX Architecture

DataX follows a Framework + Plugin model. Data sources are abstracted as Reader plugins, while destinations are Writer plugins. Both are managed by a central Framework that handles buffering, flow control, concurrency, and data conversion.

Reader (Collect Module) : Collects data from the source and forwards it to the Framework.

Writer (Write Module) : Pulls data from the Framework and writes it to the target.

Framework (Mediator) : Connects Reader and Writer, providing buffering, flow‑control, concurrency, and transformation.

Job Scheduling

When a DataX Job starts, it splits the task into multiple Task units based on the source’s split strategy. The Scheduler groups Tasks into TaskGroup s according to the configured concurrency. Each Task follows the pipeline Reader → Channel → Writer. The Job monitors all TaskGroups and exits successfully when every group finishes (non‑zero exit code indicates failure).

DataX scheduling process:

Job splits data into Tasks and calculates the required number of TaskGroups based on the user‑defined concurrency.

Each TaskGroup runs its assigned Tasks concurrently.

Installation Steps

Prerequisites: JDK 1.8+, Python (2 or 3), Apache Maven 3.x (only needed for compiling DataX; the tar package can be used directly).

Install JDK:

# tar zxf jdk-8u181-linux-x64.tar.gz
mv jdk1.8.0_181 /usr/local/java
export JAVA_HOME=/usr/local/java
export PATH=$PATH:"$JAVA_HOME/bin"
source /etc/profile
java -version

Install DataX on Linux:

# wget http://datax-opensource.oss-cn-hangzhou.aliyuncs.com/datax.tar.gz
tar zxf datax.tar.gz -C /usr/local/
rm -rf /usr/local/datax/plugin/*/._*
# Verify installation
cd /usr/local/datax/bin
python datax.py ../job/job.json

MySQL Setup

On both hosts install MariaDB (compatible with MySQL), start the service, and secure the installation. Create a database and table, then grant full privileges to the root user:

# yum -y install mariadb mariadb-server mariadb-libs mariadb-devel
systemctl start mariadb
mysql_secure_installation
# Create database and table
CREATE DATABASE `course-study`;
CREATE TABLE `course-study`.t_member(ID int, Name varchar(20), Email varchar(30));
GRANT ALL PRIVILEGES ON *.* TO root@'%' IDENTIFIED BY '123123';
FLUSH PRIVILEGES;

Optionally create a stored procedure to generate test data:

DELIMITER $$
CREATE PROCEDURE test()
BEGIN
  DECLARE A int DEFAULT 1;
  WHILE (A < 3000000) DO
    INSERT INTO `course-study`.t_member VALUES(A, CONCAT('LiSa',A), CONCAT('LiSa',A,'@163.com'));
    SET A = A + 1;
  END WHILE;
END $$
DELIMITER ;
CALL test();

Using DataX for Synchronization

Full Synchronization :

Generate a template for MySQL‑to‑MySQL sync:

# python /usr/local/datax/bin/datax.py -r mysqlreader -w mysqlwriter

Write a JSON job configuration (e.g., install.json) specifying source and target JDBC URLs, tables, credentials, and optional pre‑SQL statements such as TRUNCATE t_member. Then run:

# python /usr/local/datax/bin/datax.py install.json

Sample output shows processing of ~3 million records at ~2.5 MB/s with zero errors.

Incremental Synchronization :

The only difference is adding a where clause to filter records and adjusting preSql as needed. Example where.json includes "where": "ID <= 1888" for incremental sync. Running the job yields the expected subset of records.

Further incremental runs can adjust the where condition (e.g., "ID > 1888 AND ID <= 2888") and remove the preSql that truncates the target table.

Key Takeaways

DataX provides a robust, plugin‑based framework for both full‑ and incremental data synchronization across heterogeneous sources, handling task parallelism, fault tolerance, and data conversion without requiring custom scripts or heavyweight ETL platforms.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Big Data MySQL Data synchronization DataX ETL

Written by

macrozheng

Dedicated to Java tech sharing and dissecting top open-source projects. Topics include Spring Boot, Spring Cloud, Docker, Kubernetes and more. Author’s GitHub project “mall” has 50K+ stars.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.