Big Data 18 min read

Using DataX for Efficient MySQL Data Synchronization

This article provides a comprehensive guide on using Alibaba's open‑source DataX tool for efficient offline synchronization between heterogeneous databases such as MySQL, covering its architecture, installation on Linux, job configuration, full‑ and incremental data transfer, and practical code examples.

Top Architect
Top Architect
Top Architect
Using DataX for Efficient MySQL Data Synchronization

Introduction

Our company has a project with 50 million records, and the reporting data is inaccurate; the business and reporting databases are cross‑database, making SQL synchronization impossible. Initial attempts with mysqldump and file‑based storage proved inefficient.

mysqldump: backup and sync both take time, and data may be generated during backup, causing unsynced data.

File storage: too slow; three hours only synced two thousand rows.

After researching, we discovered DataX, an open‑source data integration tool that offers high speed and reliable synchronization.

1. DataX Overview

DataX is the open‑source version of Alibaba Cloud DataWorks data integration, designed for offline data synchronization across heterogeneous sources such as MySQL, Oracle, HDFS, Hive, ODPS, HBase, FTP, etc.

Transforms complex mesh sync topologies into a star‑shaped data flow, acting as the middle transport layer.

Adding a new data source only requires plugging it into DataX, enabling seamless integration.

1.1 DataX 3.0 Framework Design

DataX follows a Framework + Plugin architecture, abstracting data source reading and writing as Reader/Writer plugins.

Component

Role

Reader (collector)

Collects data from the source and sends it to the Framework.

Writer (writer)

Pulls data from the Framework and writes it to the destination.

Framework (middleman)

Connects Reader and Writer, handling buffering, flow control, concurrency, and data conversion.

1.2 DataX 3.0 Core Architecture

A DataX job represents a single synchronization task. Upon receiving a job, DataX launches a process that manages data cleaning, task splitting, and TaskGroup management.

Job splits into multiple Tasks based on source split strategy for concurrent execution.

Scheduler assembles Tasks into TaskGroups according to configured concurrency.

Each Task runs Reader → Channel → Writer threads.

Job monitors TaskGroups and exits when all complete.

DataX scheduling process:

Job splits into Tasks, calculates required TaskGroups based on concurrency.

Task/Channel = TaskGroup; each TaskGroup runs its Tasks.

2. Using DataX for Data Synchronization

Prerequisites: JDK 1.8+, Python 2/3, Apache Maven 3.x (for compiling DataX).

Installation steps for JDK, DataX, MySQL, and creating test tables are provided, followed by granting privileges and creating a stored procedure to generate test data.

[root@MySQL-1 ~]
# tar zxf jdk-8u181-linux-x64.tar.gz 
[root@DataX ~]
# mv jdk1.8.0_181 /usr/local/java
[root@MySQL-1 ~]
# cat <
> /etc/profile
export JAVA_HOME=/usr/local/java
export PATH=$PATH:"$JAVA_HOME/bin"
END
[root@MySQL-1 ~]
# source /etc/profile
[root@MySQL-1 ~]
# java -version
[root@MySQL-1 ~]
# wget http://datax-opensource.oss-cn-hangzhou.aliyuncs.com/datax.tar.gz
[root@MySQL-1 ~]
# tar zxf datax.tar.gz -C /usr/local/
[root@MySQL-1 ~]
# rm -rf /usr/local/datax/plugin/*/._*   # delete hidden files
MariaDB [(none)]> create database `course-study`;
Query OK, 1 row affected (0.00 sec)
MariaDB [(none)]> create table `course-study`.t_member(ID int,Name varchar(20),Email varchar(30));
Query OK, 0 rows affected (0.00 sec)
grant all privileges on *.* to root@'%' identified by '123123';
flush privileges;
DELIMITER $$
CREATE PROCEDURE test()
BEGIN
  declare A int default 1;
  while (A < 3000000) do
    insert into `course-study`.t_member values(A,concat('LiSa',A),concat('LiSa',A,'@163.com'));
    set A = A + 1;
  END while;
END $$
DELIMITER ;

2.1 Full‑size Synchronization

Generate a MySQL‑to‑MySQL template, edit the JSON job file, and run:

# python /usr/local/datax/bin/datax.py install.json

Sample output shows 2.57 MB/s throughput and successful completion.

2.2 Incremental Synchronization

Incremental sync is achieved by adding a where clause to filter records. Example JSON includes the where condition and removes preSql truncation.

"where": "ID <= 1888"

Running the job yields 62 records/s for the filtered dataset.

The article concludes with a reminder that full‑sync may be interrupted for very large datasets, making incremental sync essential.

Promotional Content

Readers are invited to join a DeepSeek practical collection, offering AI scenario demos, tool recommendations, and 100 AI‑empowered use‑case examples for a pre‑sale price of 29.9 CNY.

Additional offers include a ChatGPT 4.0 domestic service, a knowledge‑sharing community for AI and side‑business projects, and various exclusive resources such as interview question packs and AI‑related training materials.

Contact information, QR codes, and links to related articles are provided throughout the promotional sections.

Big DataMySQLData SynchronizationDataXetl
Top Architect
Written by

Top Architect

Top Architect focuses on sharing practical architecture knowledge, covering enterprise, system, website, large‑scale distributed, and high‑availability architectures, plus architecture adjustments using internet technologies. We welcome idea‑driven, sharing‑oriented architects to exchange and learn together.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.