How to Use Alibaba DataX for Efficient MySQL Data Synchronization
This guide explains how to install DataX, set up MySQL environments, configure JSON job files, and run both full and incremental data synchronization between heterogeneous databases using DataX's Reader/Writer framework and job scheduling features.
DataX Overview
DataX is Alibaba Cloud DataWorks' open‑source data integration solution, primarily designed for offline synchronization across heterogeneous data sources such as MySQL, Oracle, HDFS, Hive, ODPS, HBase, FTP, and others.
DataX 3.0 Framework Design
DataX adopts a Framework + Plugin architecture, abstracting data source reading and writing as Reader and Writer plugins that plug into the overall synchronization framework.
When a new data source needs to be added, simply develop or configure a corresponding Reader/Writer plugin and integrate it without affecting existing pipelines.
DataX Core Architecture
A DataX job (Job) is the basic execution unit. After receiving a Job, DataX launches a process that performs the following steps:
Split the Job into multiple small Tasks based on the source's split strategy.
The Scheduler groups Tasks into TaskGroups according to the configured concurrency.
Each Task starts a Reader, passes data through a Channel, and finally invokes a Writer thread.
The Job monitors all TaskGroups and exits successfully when every TaskGroup finishes (non‑zero exit code indicates failure).
Using DataX for Data Synchronization
Preparation
JDK 1.8 or higher (recommended 1.8)
Python 2 or 3
Apache Maven 3.x (only needed for compiling DataX; the tar package can be used directly)
Install JDK
# ls
anaconda-ks.cfg jdk-8u181-linux-x64.tar.gz
# tar zxf jdk-8u181-linux-x64.tar.gz
# mv jdk1.8.0_181 /usr/local/java
# cat <<END >>/etc/profile
export JAVA_HOME=/usr/local/java
export PATH=$PATH:$JAVA_HOME/bin
END
# source /etc/profile
# java -versionCentOS 7 already includes Python 2.7, so no additional installation is required.
Install DataX
# wget http://datax-opensource.oss-cn-hangzhou.aliyuncs.com/datax.tar.gz
# tar zxf datax.tar.gz -C /usr/local/
# rm -rf /usr/local/datax/plugin/*/._* # delete hidden filesInstall MySQL (MariaDB) on Both Hosts
# yum -y install mariadb mariadb-server mariadb-libs mariadb-devel
# systemctl start mariadb
# mysql_secure_installationDuring mysql_secure_installation, set the root password (e.g., 123123) and keep remote root login enabled.
Create Database and Table
CREATE DATABASE `course-study`;
CREATE TABLE `course-study`.t_member(
ID INT,
Name VARCHAR(20),
Email VARCHAR(30)
);Grant Permissions
GRANT ALL PRIVILEGES ON *.* TO 'root'@'%' IDENTIFIED BY '123123';
FLUSH PRIVILEGES;Create a Stored Procedure for Test Data
DELIMITER $$
CREATE PROCEDURE test()
BEGIN
DECLARE A INT DEFAULT 1;
WHILE (A < 3000000) DO
INSERT INTO `course-study`.t_member VALUES(A, CONCAT('LiSa',A), CONCAT('LiSa',A,'@163.com'));
SET A = A + 1;
END WHILE;
END $$
DELIMITER ;Execute CALL test(); to generate test records.
Full‑Volume Synchronization with DataX
Generate a MySQL‑to‑MySQL template:
# python /usr/local/datax/bin/datax.py -r mysqlreader -w mysqlwriter
{
"job": {
"content": [{
"reader": {
"name": "mysqlreader",
"parameter": {
"column": [],
"connection": [{
"jdbcUrl": ["jdbc:mysql://192.168.1.1:3306/course-study?useUnicode=true&characterEncoding=utf8"],
"table": ["t_member"]
}],
"username": "root",
"password": "123123"
}
},
"writer": {
"name": "mysqlwriter",
"parameter": {
"column": [],
"connection": [{
"jdbcUrl": "jdbc:mysql://192.168.1.2:3306/course-study?useUnicode=true&characterEncoding=utf8",
"table": ["t_member"]
}],
"username": "root",
"password": "123123",
"preSql": ["truncate t_member"],
"writeMode": "insert"
}
}
}],
"setting": {"speed": {"channel": "5"}}
}
}Save the JSON as install.json and run:
# python /usr/local/datax/bin/datax.py install.jsonDataX will report total records, throughput, and execution time.
Incremental Synchronization
The only difference from full‑volume sync is the use of the where clause for filtering and optionally removing the preSql step.
{
"job": {
"content": [{
"reader": {
"name": "mysqlreader",
"parameter": {
"column": ["*"],
"splitPk": "ID",
"where": "ID <= 1888",
"connection": [{
"jdbcUrl": ["jdbc:mysql://192.168.1.1:3306/course-study?useUnicode=true&characterEncoding=utf8"],
"table": ["t_member"]
}],
"username": "root",
"password": "123123"
}
},
"writer": {
"name": "mysqlwriter",
"parameter": {
"column": ["*"],
"connection": [{
"jdbcUrl": "jdbc:mysql://192.168.1.2:3306/course-study?useUnicode=true&characterEncoding=utf8",
"table": ["t_member"]
}],
"username": "root",
"password": "123123",
"writeMode": "insert"
}
}
}],
"setting": {"speed": {"channel": "5"}}
}
}Run the job and verify the filtered records in the target database.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Java High-Performance Architecture
Sharing Java development articles and resources, including SSM architecture and the Spring ecosystem (Spring Boot, Spring Cloud, MyBatis, Dubbo, Docker), Zookeeper, Redis, architecture design, microservices, message queues, Git, etc.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
