Master DataX: Fast MySQL‑to‑MySQL Data Synchronization and Incremental Updates
This guide walks you through installing JDK, Python and DataX on Linux, configuring MySQL sources, creating the necessary tables and stored procedures, and using DataX's JSON job definitions to perform both full‑load and incremental data synchronization between two MySQL instances, complete with performance metrics and troubleshooting tips.
Introduction
A project required synchronizing 50 million rows between a business database and a reporting database, where traditional SQL scripts or mysqldump were too slow or produced inconsistent results. DataX, an open‑source data integration tool from Alibaba, was chosen for its high‑speed, reliable offline data transfer.
What Is DataX?
DataX is the open‑source version of Alibaba Cloud DataWorks data integration. It provides a plug‑in architecture (Reader/Writer) that abstracts data sources such as MySQL, Oracle, HDFS, Hive, HBase, FTP, etc., and connects them through a central Framework component.
Framework Design (DataX 3.0)
DataX follows a Framework + Plugin model. The Reader plugin pulls data from a source, the Writer plugin pushes data to a destination, and the Framework mediates between them, handling buffering, flow control, concurrency, and data conversion.
Reader (采集模块) → Framework → Writer (写入模块)Core Architecture
A DataX Job is split into multiple Task instances, which are grouped into TaskGroup s based on the configured concurrency. Each Task runs a Reader, passes data through a Channel, and finally invokes a Writer. The job monitors all TaskGroup s and exits with a non‑zero status on failure.
Task / Channel = TaskGroupUsing DataX for Data Synchronization
Preparation
JDK 1.8 or higher (recommended 1.8)
Python 2 or 3 (CentOS 7 already includes Python 2.7)
Apache Maven 3.x (only needed for compiling DataX; the binary tarball can be used directly)
Install JDK
# wget https://download.oracle.com/java/…/jdk-8u181-linux-x64.tar.gz
# tar -zxf jdk-8u181-linux-x64.tar.gz -C /usr/local/
# mv jdk1.8.0_181 /usr/local/java
# echo "export JAVA_HOME=/usr/local/java" >> /etc/profile
# echo "export PATH=\$PATH:\$JAVA_HOME/bin" >> /etc/profile
# source /etc/profile
# java -versionInstall DataX
# wget http://datax-opensource.oss-cn-hangzhou.aliyuncs.com/datax.tar.gz
# tar -zxf datax.tar.gz -C /usr/local/
# rm -rf /usr/local/datax/plugin/*/._* # remove hidden filesIf hidden files are not removed, DataX may report missing plugin configuration files.
Install MySQL (MariaDB) on Both Hosts
# yum -y install mariadb mariadb-server mariadb-libs mariadb-devel
# systemctl start mariadb
# mysql_secure_installation # set root password, remove test DB, etc.Create Test Database and Table
CREATE DATABASE `course-study`;
CREATE TABLE `course-study`.t_member (ID INT, Name VARCHAR(20), Email VARCHAR(30));Grant replication privileges to the root user on both hosts:
GRANT ALL PRIVILEGES ON *.* TO 'root'@'%' IDENTIFIED BY '123123';
FLUSH PRIVILEGES;Create a Stored Procedure for Bulk Data
DELIMITER $$
CREATE PROCEDURE test()
BEGIN
DECLARE A INT DEFAULT 1;
WHILE (A < 3000000) DO
INSERT INTO `course-study`.t_member VALUES (A, CONCAT('LiSa', A), CONCAT('LiSa', A, '@163.com'));
SET A = A + 1;
END WHILE;
END $$
DELIMITER ;Execute the procedure on the source database to generate millions of rows.
Full‑Load Synchronization
Generate a JSON job template for MySQL‑to‑MySQL sync, then fill in connection details, credentials, and table names. Example snippet:
{
"job": {
"content": [{
"reader": {
"name": "mysqlreader",
"parameter": {
"username": "root",
"password": "123123",
"connection": [{
"jdbcUrl": ["jdbc:mysql://192.168.1.1:3306/course-study?useUnicode=true&characterEncoding=utf8"],
"table": ["t_member"]
}]
}
},
"writer": {
"name": "mysqlwriter",
"parameter": {
"username": "root",
"password": "123123",
"connection": [{
"jdbcUrl": ["jdbc:mysql://192.168.1.2:3306/course-study?useUnicode=true&characterEncoding=utf8"],
"table": ["t_member"]
}],
"preSql": ["TRUNCATE t_member"]
}
}
}],
"setting": {"speed": {"channel": "5"}}
}
}Run the job:
# python /usr/local/datax/bin/datax.py install.jsonSample output shows 2.99 million records transferred in ~42 seconds (≈2.57 MB/s, 74 999 rec/s) with zero errors.
Incremental Synchronization
Incremental sync is achieved by adding a where clause to the reader configuration and optionally removing the preSql step (since the table is no longer truncated). Example:
{
"job": {
"content": [{
"reader": {
"name": "mysqlreader",
"parameter": {
"username": "root",
"password": "123123",
"where": "ID <= 1888",
"connection": [{
"jdbcUrl": ["jdbc:mysql://192.168.1.1:3306/course-study?useUnicode=true&characterEncoding=utf8"],
"table": ["t_member"]
}]
}
},
"writer": {
"name": "mysqlwriter",
"parameter": {
"username": "root",
"password": "123123",
"connection": [{
"jdbcUrl": ["jdbc:mysql://192.168.1.2:3306/course-study?useUnicode=true&characterEncoding=utf8"],
"table": ["t_member"]
}]
}
}
}],
"setting": {"speed": {"channel": "5"}}
}
}Running this job transfers only the rows matching the condition (e.g., 1 888 records) and reports the elapsed time and throughput.
Key Takeaways
DataX converts complex, multi‑source data pipelines into a simple star‑topology, making it easy to add new sources.
Full‑load jobs are fast but may be interrupted on very large datasets; incremental jobs using where filters mitigate this risk.
Proper preparation—installing JDK, Python, DataX, and configuring MySQL permissions—is essential for a smooth run.
After the synchronization, verify the target database to ensure data consistency.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Top Architect
Top Architect focuses on sharing practical architecture knowledge, covering enterprise, system, website, large‑scale distributed, and high‑availability architectures, plus architecture adjustments using internet technologies. We welcome idea‑driven, sharing‑oriented architects to exchange and learn together.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
