Big Data 14 min read

Using Alibaba DataX 3.0 for MySQL Data Synchronization: Installation, Configuration, and Incremental Sync

This article introduces Alibaba DataX 3.0, explains its architecture and role‑based design, walks through Linux installation, JDK setup, MySQL preparation, and provides step‑by‑step examples of full‑load and incremental data synchronization between two MySQL instances using JSON job configurations and command‑line execution.

Java Tech Enthusiast
Java Tech Enthusiast
Java Tech Enthusiast
Using Alibaba DataX 3.0 for MySQL Data Synchronization: Installation, Configuration, and Incremental Sync

Introduction

A project with 50 million rows required reliable data synchronization between a business database and a reporting database, but traditional SQL or mysqldump approaches were too slow or inaccurate, prompting the use of DataX.

DataX Overview

DataX is the open‑source version of Alibaba Cloud DataWorks data integration, primarily used for offline synchronization of heterogeneous data sources such as MySQL, Oracle, HDFS, Hive, ODPS, HBase, FTP, and others.

It solves heterogeneous data source sync problems by converting complex mesh topologies into a star‑shaped data flow, where DataX acts as the middle‑man connecting various sources.

DataX 3.0 Architecture

DataX adopts a Framework + Plugin architecture. The data source reading and writing are abstracted as Reader/Writer plugins, which are managed by the central Framework.

Role

Function

Reader

(采集模块)

Collects data from the source and sends it to the

Framework

.

Writer

(写入模块)

Pulls data from the

Framework

and writes it to the destination.

Framework

(中间商)

Connects

Reader

and

Writer

, handling buffering, flow‑control, concurrency, and data conversion.

Core Scheduling Process

DataX Job splits the whole job into multiple Task units according to the split strategy.

Scheduler groups Tasks into TaskGroup based on the configured concurrency.

Each Task runs a pipeline: Reader → Channel → Writer .

The Job monitors all TaskGroups and exits successfully when every group finishes (non‑zero exit code on error).

Installation and Preparation

JDK 1.8+ (recommended 1.8)

Python 2 or 3

Apache Maven 3.x (only needed for compiling DataX; binary tar package can be used without Maven)

Example of installing JDK on CentOS:

# tar zxf jdk-8u181-linux-x64.tar.gz
mv jdk1.8.0_181 /usr/local/java
cat >> /etc/profile <

MySQL Setup

Install MariaDB on two hosts (CentOS 7.4) and start the service.

Create database and table: create database `course-study`; create table `course-study`.t_member(ID int, Name varchar(20), Email varchar(30));

Grant privileges for remote access: grant all privileges on *.* to root@'%' identified by '123123'; flush privileges;

Running a Full‑Load Job

Generate a template JSON with:

# python /usr/local/datax/bin/datax.py -r mysqlreader -w mysqlwriter

Fill in the connection details (example shown below) and execute:

# python /usr/local/datax/bin/datax.py install.json

Sample output indicates 2,999,999 records synchronized at ~2.57 MB/s.

Incremental Synchronization

The only difference from full‑load is adding a where clause to the reader configuration and optionally removing preSql statements.

{
  "job": {
    "content": [{
      "reader": {
        "name": "mysqlreader",
        "parameter": {
          "username": "root",
          "password": "123123",
          "column": ["*"],
          "splitPk": "ID",
          "where": "ID <= 1888",
          "connection": [{
            "jdbcUrl": ["jdbc:mysql://192.168.1.1:3306/course-study?useUnicode=true&characterEncoding=utf8"],
            "table": ["t_member"]
          }]
        }
      },
      "writer": {
        "name": "mysqlwriter",
        "parameter": {
          "column": ["*"],
          "connection": [{
            "jdbcUrl": "jdbc:mysql://192.168.1.2:3306/course-study?useUnicode=true&characterEncoding=utf8",
            "table": ["t_member"]
          }],
          "username": "root",
          "password": "123123",
          "preSql": ["truncate t_member"],
          "session": ["set session sql_mode='ANSI'"],
          "writeMode": "insert"
        }
      }
    }],
    "setting": {"speed": {"channel": "5"}}
  }
}

Running the job produces 1,888 records (the defined incremental range) with the expected speed.

Key Takeaways

DataX provides a reliable way to sync large volumes of data between heterogeneous sources.

Full‑load jobs are straightforward; incremental jobs require only a where clause.

The Framework + Plugin model makes it easy to add new readers or writers without changing core logic.

LinuxMySQLData SynchronizationDataXetlIncremental Sync
Java Tech Enthusiast
Written by

Java Tech Enthusiast

Sharing computer programming language knowledge, focusing on Java fundamentals, data structures, related tools, Spring Cloud, IntelliJ IDEA... Book giveaways, red‑packet rewards and other perks await!

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.