Big Data 15 min read

Master DataX: Efficient Offline Data Sync for Heterogeneous Sources

This guide walks through the challenges of synchronizing massive datasets across heterogeneous databases, introduces Alibaba's open‑source DataX tool, explains its framework‑plugin architecture, and provides step‑by‑step instructions—including environment setup, installation, job configuration, and both full and incremental MySQL synchronization—complete with code examples and performance metrics.

macrozheng
macrozheng
macrozheng
Master DataX: Efficient Offline Data Sync for Heterogeneous Sources

A project with 50 million records required reliable cross‑database synchronization, but traditional mysqldump or file‑based approaches proved too slow and inaccurate.

DataX Overview

DataX is the open‑source version of Alibaba Cloud DataWorks Data Integration, designed for offline data synchronization between heterogeneous sources such as MySQL, Oracle, HDFS, Hive, ODPS, HBase, and FTP. It abstracts source and target connections into a star‑shaped data pipeline, simplifying the addition of new data sources.

DataX 3.0 Framework Design

DataX follows a Framework + Plugin model where data readers and writers are implemented as plugins. The core components are:

Reader : collects data from the source and sends it to the Framework.

Writer : receives data from the Framework and writes it to the destination.

Framework : connects Reader and Writer, handling buffering, flow control, concurrency, and data transformation.

DataX 3.0 Core Architecture

A synchronization job (Job) is split into multiple parallel tasks (Task) based on the source’s split strategy. The Scheduler groups tasks into TaskGroups, each launched by the Job. Execution proceeds as Reader → Channel → Writer, and the Job monitors TaskGroups until completion.

Job divides work into Tasks.

Scheduler assembles Tasks into TaskGroups accordingli​ve to the configured concurrency.

Each Task starts a Reader, passes data through a Channel, and invokes a Writer.

Job exits successfully when all TaskGroups finish (non‑zero exit code indicates failure).

Preparing the Environment

JDK 1.8 or higher (recommended 1.8)

Python 2 or 3

Apache Maven 3.x (only needed for compiling DataX from source)

Example host table:

Hostname

OS

IP

Package

MySQL-1

CentOS 7.4

192.168.1.1

jdk-8u181-linux-x64.tar.gz
datax.tar.gz

MySQL-2

CentOS 7.4

192.168.1.2

Installing JDK

<code># ls
anaconda-ks.cfg  jdk-8u181-linux-x64.tar.gz
# tar zxf jdk-8u181-linux-x64.tar.gz
# mv jdk1.8.0_181 /usr/local/java
# cat <<END >> /etc/profile
export JAVA_HOME=/usr/local/java
export PATH=$PATH:"$JAVA_HOME/bin"
END
# source /etc/profile
# java -version</code>

CentOS 7 already includes Python 2.7, so no additional installation is required.

Installing DataX on Linux

<code># wget http://datax-opensource.oss-cn-hangzhou.aliyuncs.com/datax.tar.gz
# tar zxf datax.tar.gz -C /usr/local/
# rm -rf /usr/local/datax/plugin/*/._*
# cd /usr/local/datax/bin
# python datax.py ../job/job.json   # verify installation</code>

Successful verification prints job statistics such as total records, speed, and duration.

Using DataX for Data Synchronization

DataX jobs are defined in JSON. A basic template for a stream reader/writer looks like:

<code>{
  "job": {
    "content": [
      {
        "reader": {"name": "streamreader", "parameter": {}},
        "writer": {"name": "streamwriter", "parameter": {"print": true}}
      }
    ],
    "setting": {"speed": {"channel": ""}}
  }
}</code>

For MySQL‑to‑MySQL full synchronization, create a JSON job specifying the source and target connections, columns, and optional pre‑SQL statements (e.g.,

truncate t_member

). Example execution:

<code># python /usr/local/datax/bin/datax.py install.json</code>

The job logs show total records (e.g., 2,999,999), throughput (≈2.57 MB/s), and completion time.

Incremental Synchronization

Incremental sync is achieved by adding a

where

clause to the reader configuration and optionally removing any

preSql

that would truncate the target table.

<code>{
  "job": {
    "content": [
      {
        "reader": {
          "name": "mysqlreader",
          "parameter": {
            "username": "root",
            "password": "123123",
            "column": ["*"],
            "splitPk": "ID",
            "where": "ID <= 1888",
            "connection": [{"jdbcUrl": ["jdbc:mysql://192.168.1.1:3306/course-study?useUnicode=true&characterEncoding=utf8"], "table": ["t_member"]}]
          }
        },
        "writer": {
          "name": "mysqlwriter",
          "parameter": {
            "column": ["*"],
            "connection": [{"jdbcUrl": "jdbc:mysql://192.168.1.2:3306/course-study?useUnicode=true&characterEncoding=utf8", "table": ["t_member"]}],
            "password": "123123",
            "preSql": ["truncate t_member"],
            "session": ["set session sql_mode='ANSI'"],
            "username": "root",
            "writeMode": "insert"
          }
        }
      }
    ],
    "setting": {"speed": {"channel": "5"}}
  }
}</code>

Running this job synchronizes only the rows matching the

where

condition, allowing efficient incremental updates.

Big DataMySQLDataXETLdata integrationIncremental Sync
macrozheng
Written by

macrozheng

Dedicated to Java tech sharing and dissecting top open-source projects. Topics include Spring Boot, Spring Cloud, Docker, Kubernetes and more. Author’s GitHub project “mall” has 50K+ stars.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.