Big Data 15 min read

Master DataX: Efficient Data Synchronization for Massive MySQL Datasets

Learn how to overcome inaccurate reporting and cross-database challenges by using Alibaba’s open-source DataX tool to efficiently synchronize massive MySQL datasets, covering its architecture, job scheduling, installation, configuration, full- and incremental sync, and practical command-line examples.

macrozheng
macrozheng
macrozheng
Master DataX: Efficient Data Synchronization for Massive MySQL Datasets

Problem Overview

A project contains up to 50 million records, but reporting data is inaccurate and the business and reporting databases operate across different instances, making direct SQL synchronization impossible. Initial attempts using

mysqldump

or file‑based storage proved impractical because backups and syncs were time‑consuming and could generate new data during the process.

What is DataX

DataX is the open‑source version of Alibaba Cloud DataWorks data integration. It is designed primarily for offline synchronization of heterogeneous data sources such as relational databases (MySQL, Oracle), HDFS, Hive, ODPS, HBase, FTP, etc. By turning complex mesh sync topologies into a star‑shaped data link, DataX acts as a middle‑transport layer that connects various sources seamlessly.

DataX Overview
DataX Overview

DataX Architecture

DataX follows a Framework + Plugin model. Data sources are abstracted as Reader plugins, while destinations are Writer plugins. Both are managed by a central Framework that handles buffering, flow control, concurrency, and data conversion.

DataX Core Architecture
DataX Core Architecture

Reader (Collect Module) : Collects data from the source and forwards it to the Framework.

Writer (Write Module) : Pulls data from the Framework and writes it to the target.

Framework (Mediator) : Connects Reader and Writer, providing buffering, flow‑control, concurrency, and transformation.

Job Scheduling

When a DataX Job starts, it splits the task into multiple Task units based on the source’s split strategy. The Scheduler groups Tasks into TaskGroup s according to the configured concurrency. Each Task follows the pipeline

Reader → Channel → Writer

. The Job monitors all TaskGroups and exits successfully when every group finishes (non‑zero exit code indicates failure).

DataX scheduling process:

Job splits data into Tasks and calculates the required number of TaskGroups based on the user‑defined concurrency.

Each TaskGroup runs its assigned Tasks concurrently.

Installation Steps

Prerequisites: JDK 1.8+, Python (2 or 3), Apache Maven 3.x (only needed for compiling DataX; the

tar

package can be used directly).

Install JDK:

<code># tar zxf jdk-8u181-linux-x64.tar.gz
mv jdk1.8.0_181 /usr/local/java
export JAVA_HOME=/usr/local/java
export PATH=$PATH:"$JAVA_HOME/bin"
source /etc/profile
java -version</code>

Install DataX on Linux:

<code># wget http://datax-opensource.oss-cn-hangzhou.aliyuncs.com/datax.tar.gz
tar zxf datax.tar.gz -C /usr/local/
rm -rf /usr/local/datax/plugin/*/._*
# Verify installation
cd /usr/local/datax/bin
python datax.py ../job/job.json</code>

MySQL Setup

On both hosts install MariaDB (compatible with MySQL), start the service, and secure the installation. Create a database and table, then grant full privileges to the

root

user:

<code># yum -y install mariadb mariadb-server mariadb-libs mariadb-devel
systemctl start mariadb
mysql_secure_installation
# Create database and table
CREATE DATABASE `course-study`;
CREATE TABLE `course-study`.t_member(ID int, Name varchar(20), Email varchar(30));
GRANT ALL PRIVILEGES ON *.* TO root@'%' IDENTIFIED BY '123123';
FLUSH PRIVILEGES;</code>

Optionally create a stored procedure to generate test data:

<code>DELIMITER $$
CREATE PROCEDURE test()
BEGIN
  DECLARE A int DEFAULT 1;
  WHILE (A < 3000000) DO
    INSERT INTO `course-study`.t_member VALUES(A, CONCAT('LiSa',A), CONCAT('LiSa',A,'@163.com'));
    SET A = A + 1;
  END WHILE;
END $$
DELIMITER ;
CALL test();</code>

Using DataX for Synchronization

Full Synchronization :

Generate a template for MySQL‑to‑MySQL sync:

<code># python /usr/local/datax/bin/datax.py -r mysqlreader -w mysqlwriter</code>

Write a JSON job configuration (e.g.,

install.json

) specifying source and target JDBC URLs, tables, credentials, and optional pre‑SQL statements such as

TRUNCATE t_member

. Then run:

<code># python /usr/local/datax/bin/datax.py install.json</code>

Sample output shows processing of ~3 million records at ~2.5 MB/s with zero errors.

Incremental Synchronization :

The only difference is adding a

where

clause to filter records and adjusting

preSql

as needed. Example

where.json

includes

"where": "ID <= 1888"

for incremental sync. Running the job yields the expected subset of records.

Further incremental runs can adjust the

where

condition (e.g.,

"ID > 1888 AND ID <= 2888"

) and remove the

preSql

that truncates the target table.

Key Takeaways

DataX provides a robust, plugin‑based framework for both full‑ and incremental data synchronization across heterogeneous sources, handling task parallelism, fault tolerance, and data conversion without requiring custom scripts or heavyweight ETL platforms.

Big DataMySQLData SynchronizationDataXetl
macrozheng
Written by

macrozheng

Dedicated to Java tech sharing and dissecting top open-source projects. Topics include Spring Boot, Spring Cloud, Docker, Kubernetes and more. Author’s GitHub project “mall” has 50K+ stars.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.