Master DataX: Efficient Data Synchronization for Massive MySQL Datasets
Learn how to overcome inaccurate reporting and cross-database challenges by using Alibaba’s open-source DataX tool to efficiently synchronize massive MySQL datasets, covering its architecture, job scheduling, installation, configuration, full- and incremental sync, and practical command-line examples.
Problem Overview
A project contains up to 50 million records, but reporting data is inaccurate and the business and reporting databases operate across different instances, making direct SQL synchronization impossible. Initial attempts using
mysqldumpor file‑based storage proved impractical because backups and syncs were time‑consuming and could generate new data during the process.
What is DataX
DataX is the open‑source version of Alibaba Cloud DataWorks data integration. It is designed primarily for offline synchronization of heterogeneous data sources such as relational databases (MySQL, Oracle), HDFS, Hive, ODPS, HBase, FTP, etc. By turning complex mesh sync topologies into a star‑shaped data link, DataX acts as a middle‑transport layer that connects various sources seamlessly.
DataX Architecture
DataX follows a Framework + Plugin model. Data sources are abstracted as Reader plugins, while destinations are Writer plugins. Both are managed by a central Framework that handles buffering, flow control, concurrency, and data conversion.
Reader (Collect Module) : Collects data from the source and forwards it to the Framework.
Writer (Write Module) : Pulls data from the Framework and writes it to the target.
Framework (Mediator) : Connects Reader and Writer, providing buffering, flow‑control, concurrency, and transformation.
Job Scheduling
When a DataX Job starts, it splits the task into multiple Task units based on the source’s split strategy. The Scheduler groups Tasks into TaskGroup s according to the configured concurrency. Each Task follows the pipeline
Reader → Channel → Writer. The Job monitors all TaskGroups and exits successfully when every group finishes (non‑zero exit code indicates failure).
DataX scheduling process:
Job splits data into Tasks and calculates the required number of TaskGroups based on the user‑defined concurrency.
Each TaskGroup runs its assigned Tasks concurrently.
Installation Steps
Prerequisites: JDK 1.8+, Python (2 or 3), Apache Maven 3.x (only needed for compiling DataX; the
tarpackage can be used directly).
Install JDK:
<code># tar zxf jdk-8u181-linux-x64.tar.gz
mv jdk1.8.0_181 /usr/local/java
export JAVA_HOME=/usr/local/java
export PATH=$PATH:"$JAVA_HOME/bin"
source /etc/profile
java -version</code>Install DataX on Linux:
<code># wget http://datax-opensource.oss-cn-hangzhou.aliyuncs.com/datax.tar.gz
tar zxf datax.tar.gz -C /usr/local/
rm -rf /usr/local/datax/plugin/*/._*
# Verify installation
cd /usr/local/datax/bin
python datax.py ../job/job.json</code>MySQL Setup
On both hosts install MariaDB (compatible with MySQL), start the service, and secure the installation. Create a database and table, then grant full privileges to the
rootuser:
<code># yum -y install mariadb mariadb-server mariadb-libs mariadb-devel
systemctl start mariadb
mysql_secure_installation
# Create database and table
CREATE DATABASE `course-study`;
CREATE TABLE `course-study`.t_member(ID int, Name varchar(20), Email varchar(30));
GRANT ALL PRIVILEGES ON *.* TO root@'%' IDENTIFIED BY '123123';
FLUSH PRIVILEGES;</code>Optionally create a stored procedure to generate test data:
<code>DELIMITER $$
CREATE PROCEDURE test()
BEGIN
DECLARE A int DEFAULT 1;
WHILE (A < 3000000) DO
INSERT INTO `course-study`.t_member VALUES(A, CONCAT('LiSa',A), CONCAT('LiSa',A,'@163.com'));
SET A = A + 1;
END WHILE;
END $$
DELIMITER ;
CALL test();</code>Using DataX for Synchronization
Full Synchronization :
Generate a template for MySQL‑to‑MySQL sync:
<code># python /usr/local/datax/bin/datax.py -r mysqlreader -w mysqlwriter</code>Write a JSON job configuration (e.g.,
install.json) specifying source and target JDBC URLs, tables, credentials, and optional pre‑SQL statements such as
TRUNCATE t_member. Then run:
<code># python /usr/local/datax/bin/datax.py install.json</code>Sample output shows processing of ~3 million records at ~2.5 MB/s with zero errors.
Incremental Synchronization :
The only difference is adding a
whereclause to filter records and adjusting
preSqlas needed. Example
where.jsonincludes
"where": "ID <= 1888"for incremental sync. Running the job yields the expected subset of records.
Further incremental runs can adjust the
wherecondition (e.g.,
"ID > 1888 AND ID <= 2888") and remove the
preSqlthat truncates the target table.
Key Takeaways
DataX provides a robust, plugin‑based framework for both full‑ and incremental data synchronization across heterogeneous sources, handling task parallelism, fault tolerance, and data conversion without requiring custom scripts or heavyweight ETL platforms.
macrozheng
Dedicated to Java tech sharing and dissecting top open-source projects. Topics include Spring Boot, Spring Cloud, Docker, Kubernetes and more. Author’s GitHub project “mall” has 50K+ stars.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.