Big Data 14 min read

How to Use Alibaba DataX for Efficient MySQL Data Synchronization

This guide explains how to install DataX, set up MySQL environments, configure JSON job files, and run both full and incremental data synchronization between heterogeneous databases using DataX's Reader/Writer framework and job scheduling features.

Java High-Performance Architecture

Feb 3, 2023

How to Use Alibaba DataX for Efficient MySQL Data Synchronization

DataX Overview

DataX is Alibaba Cloud DataWorks' open‑source data integration solution, primarily designed for offline synchronization across heterogeneous data sources such as MySQL, Oracle, HDFS, Hive, ODPS, HBase, FTP, and others.

DataX 3.0 Framework Design

DataX adopts a Framework + Plugin architecture, abstracting data source reading and writing as Reader and Writer plugins that plug into the overall synchronization framework.

When a new data source needs to be added, simply develop or configure a corresponding Reader/Writer plugin and integrate it without affecting existing pipelines.

DataX Core Architecture

A DataX job (Job) is the basic execution unit. After receiving a Job, DataX launches a process that performs the following steps:

Split the Job into multiple small Tasks based on the source's split strategy.

The Scheduler groups Tasks into TaskGroups according to the configured concurrency.

Each Task starts a Reader, passes data through a Channel, and finally invokes a Writer thread.

The Job monitors all TaskGroups and exits successfully when every TaskGroup finishes (non‑zero exit code indicates failure).

Using DataX for Data Synchronization

Preparation

JDK 1.8 or higher (recommended 1.8)

Python 2 or 3

Apache Maven 3.x (only needed for compiling DataX; the tar package can be used directly)

Install JDK

# ls
anaconda-ks.cfg  jdk-8u181-linux-x64.tar.gz
# tar zxf jdk-8u181-linux-x64.tar.gz
# mv jdk1.8.0_181 /usr/local/java
# cat <<END >>/etc/profile
export JAVA_HOME=/usr/local/java
export PATH=$PATH:$JAVA_HOME/bin
END
# source /etc/profile
# java -version

CentOS 7 already includes Python 2.7, so no additional installation is required.

Install DataX

# wget http://datax-opensource.oss-cn-hangzhou.aliyuncs.com/datax.tar.gz
# tar zxf datax.tar.gz -C /usr/local/
# rm -rf /usr/local/datax/plugin/*/._*   # delete hidden files

Install MySQL (MariaDB) on Both Hosts

# yum -y install mariadb mariadb-server mariadb-libs mariadb-devel
# systemctl start mariadb
# mysql_secure_installation

During mysql_secure_installation, set the root password (e.g., 123123) and keep remote root login enabled.

Create Database and Table

CREATE DATABASE `course-study`;
CREATE TABLE `course-study`.t_member(
  ID INT,
  Name VARCHAR(20),
  Email VARCHAR(30)
);

Grant Permissions

GRANT ALL PRIVILEGES ON *.* TO 'root'@'%' IDENTIFIED BY '123123';
FLUSH PRIVILEGES;

Create a Stored Procedure for Test Data

DELIMITER $$
CREATE PROCEDURE test()
BEGIN
  DECLARE A INT DEFAULT 1;
  WHILE (A < 3000000) DO
    INSERT INTO `course-study`.t_member VALUES(A, CONCAT('LiSa',A), CONCAT('LiSa',A,'@163.com'));
    SET A = A + 1;
  END WHILE;
END $$
DELIMITER ;

Execute CALL test(); to generate test records.

Full‑Volume Synchronization with DataX

Generate a MySQL‑to‑MySQL template:

# python /usr/local/datax/bin/datax.py -r mysqlreader -w mysqlwriter
{
  "job": {
    "content": [{
      "reader": {
        "name": "mysqlreader",
        "parameter": {
          "column": [],
          "connection": [{
            "jdbcUrl": ["jdbc:mysql://192.168.1.1:3306/course-study?useUnicode=true&characterEncoding=utf8"],
            "table": ["t_member"]
          }],
          "username": "root",
          "password": "123123"
        }
      },
      "writer": {
        "name": "mysqlwriter",
        "parameter": {
          "column": [],
          "connection": [{
            "jdbcUrl": "jdbc:mysql://192.168.1.2:3306/course-study?useUnicode=true&characterEncoding=utf8",
            "table": ["t_member"]
          }],
          "username": "root",
          "password": "123123",
          "preSql": ["truncate t_member"],
          "writeMode": "insert"
        }
      }
    }],
    "setting": {"speed": {"channel": "5"}}
  }
}

Save the JSON as install.json and run:

# python /usr/local/datax/bin/datax.py install.json

DataX will report total records, throughput, and execution time.

Incremental Synchronization

The only difference from full‑volume sync is the use of the where clause for filtering and optionally removing the preSql step.

{
  "job": {
    "content": [{
      "reader": {
        "name": "mysqlreader",
        "parameter": {
          "column": ["*"],
          "splitPk": "ID",
          "where": "ID <= 1888",
          "connection": [{
            "jdbcUrl": ["jdbc:mysql://192.168.1.1:3306/course-study?useUnicode=true&characterEncoding=utf8"],
            "table": ["t_member"]
          }],
          "username": "root",
          "password": "123123"
        }
      },
      "writer": {
        "name": "mysqlwriter",
        "parameter": {
          "column": ["*"],
          "connection": [{
            "jdbcUrl": "jdbc:mysql://192.168.1.2:3306/course-study?useUnicode=true&characterEncoding=utf8",
            "table": ["t_member"]
          }],
          "username": "root",
          "password": "123123",
          "writeMode": "insert"
        }
      }
    }],
    "setting": {"speed": {"channel": "5"}}
  }
}

Run the job and verify the filtered records in the target database.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Big Data MySQL Data synchronization DataX ETL

Written by

Java High-Performance Architecture

Sharing Java development articles and resources, including SSM architecture and the Spring ecosystem (Spring Boot, Spring Cloud, MyBatis, Dubbo, Docker), Zookeeper, Redis, architecture design, microservices, message queues, Git, etc.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.