Big Data 14 min read

Using Alibaba DataX 3.0 for MySQL Data Synchronization: Installation, Configuration, and Incremental Sync

This article introduces Alibaba DataX 3.0, explains its architecture and role‑based design, walks through Linux installation, JDK setup, MySQL preparation, and provides step‑by‑step examples of full‑load and incremental data synchronization between two MySQL instances using JSON job configurations and command‑line execution.

Java Tech Enthusiast
Java Tech Enthusiast
Java Tech Enthusiast
Using Alibaba DataX 3.0 for MySQL Data Synchronization: Installation, Configuration, and Incremental Sync

Introduction

A project with 50 million rows required reliable data synchronization between a business database and a reporting database, but traditional SQL or mysqldump approaches were too slow or inaccurate, prompting the use of DataX.

DataX Overview

DataX is the open‑source version of Alibaba Cloud DataWorks data integration, primarily used for offline synchronization of heterogeneous data sources such as MySQL, Oracle, HDFS, Hive, ODPS, HBase, FTP, and others.

It solves heterogeneous data source sync problems by converting complex mesh topologies into a star‑shaped data flow, where DataX acts as the middle‑man connecting various sources.

DataX 3.0 Architecture

DataX adopts a Framework + Plugin architecture. The data source reading and writing are abstracted as Reader/Writer plugins, which are managed by the central Framework.

Role

Function Reader (采集模块)

Collects data from the source and sends it to the Framework. Writer (写入模块)

Pulls data from the Framework and writes it to the destination. Framework (中间商)

Connects Reader and Writer, handling buffering, flow‑control, concurrency, and data conversion.

Core Scheduling Process

DataX Job splits the whole job into multiple Task units according to the split strategy.

Scheduler groups Tasks into TaskGroup based on the configured concurrency.

Each Task runs a pipeline: Reader → Channel → Writer.

The Job monitors all TaskGroups and exits successfully when every group finishes (non‑zero exit code on error).

Installation and Preparation

JDK 1.8+ (recommended 1.8)

Python 2 or 3

Apache Maven 3.x (only needed for compiling DataX; binary tar package can be used without Maven)

Example of installing JDK on CentOS:

# tar zxf jdk-8u181-linux-x64.tar.gz
mv jdk1.8.0_181 /usr/local/java
cat >> /etc/profile <<END
export JAVA_HOME=/usr/local/java
export PATH=$PATH:"$JAVA_HOME/bin"
END
source /etc/profile
java -version

MySQL Setup

Install MariaDB on two hosts (CentOS 7.4) and start the service.

Create database and table:

create database `course-study`;
create table `course-study`.t_member(ID int, Name varchar(20), Email varchar(30));

Grant privileges for remote access:

grant all privileges on *.* to root@'%' identified by '123123';
flush privileges;

Running a Full‑Load Job

Generate a template JSON with:

# python /usr/local/datax/bin/datax.py -r mysqlreader -w mysqlwriter

Fill in the connection details (example shown below) and execute:

# python /usr/local/datax/bin/datax.py install.json

Sample output indicates 2,999,999 records synchronized at ~2.57 MB/s.

Incremental Synchronization

The only difference from full‑load is adding a where clause to the reader configuration and optionally removing preSql statements.

{
  "job": {
    "content": [{
      "reader": {
        "name": "mysqlreader",
        "parameter": {
          "username": "root",
          "password": "123123",
          "column": ["*"],
          "splitPk": "ID",
          "where": "ID <= 1888",
          "connection": [{
            "jdbcUrl": ["jdbc:mysql://192.168.1.1:3306/course-study?useUnicode=true&characterEncoding=utf8"],
            "table": ["t_member"]
          }]
        }
      },
      "writer": {
        "name": "mysqlwriter",
        "parameter": {
          "column": ["*"],
          "connection": [{
            "jdbcUrl": "jdbc:mysql://192.168.1.2:3306/course-study?useUnicode=true&characterEncoding=utf8",
            "table": ["t_member"]
          }],
          "username": "root",
          "password": "123123",
          "preSql": ["truncate t_member"],
          "session": ["set session sql_mode='ANSI'"],
          "writeMode": "insert"
        }
      }
    }],
    "setting": {"speed": {"channel": "5"}}
  }
}

Running the job produces 1,888 records (the defined incremental range) with the expected speed.

Key Takeaways

DataX provides a reliable way to sync large volumes of data between heterogeneous sources.

Full‑load jobs are straightforward; incremental jobs require only a where clause.

The Framework + Plugin model makes it easy to add new readers or writers without changing core logic.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

linuxmysqldata synchronizationDataXETLIncremental Sync
Java Tech Enthusiast
Written by

Java Tech Enthusiast

Sharing computer programming language knowledge, focusing on Java fundamentals, data structures, related tools, Spring Cloud, IntelliJ IDEA... Book giveaways, red‑packet rewards and other perks await!

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.