Big Data 15 min read

Master DataX: Efficient Offline Data Sync for Heterogeneous Sources

This guide walks through the challenges of synchronizing massive datasets across heterogeneous databases, introduces Alibaba's open‑source DataX tool, explains its framework‑plugin architecture, and provides step‑by‑step instructions—including environment setup, installation, job configuration, and both full and incremental MySQL synchronization—complete with code examples and performance metrics.

macrozheng

Sep 27, 2024

Master DataX: Efficient Offline Data Sync for Heterogeneous Sources

A project with 50 million records required reliable cross‑database synchronization, but traditional mysqldump or file‑based approaches proved too slow and inaccurate.

DataX Overview

DataX is the open‑source version of Alibaba Cloud DataWorks Data Integration, designed for offline data synchronization between heterogeneous sources such as MySQL, Oracle, HDFS, Hive, ODPS, HBase, and FTP. It abstracts source and target connections into a star‑shaped data pipeline, simplifying the addition of new data sources.

DataX 3.0 Framework Design

DataX follows a Framework + Plugin model where data readers and writers are implemented as plugins. The core components are:

Reader : collects data from the source and sends it to the Framework.

Writer : receives data from the Framework and writes it to the destination.

Framework : connects Reader and Writer, handling buffering, flow control, concurrency, and data transformation.

DataX 3.0 Core Architecture

A synchronization job (Job) is split into multiple parallel tasks (Task) based on the source’s split strategy. The Scheduler groups tasks into TaskGroups, each launched by the Job. Execution proceeds as Reader → Channel → Writer, and the Job monitors TaskGroups until completion.

Job divides work into Tasks.

Scheduler assembles Tasks into TaskGroups accordinglive to the configured concurrency.

Each Task starts a Reader, passes data through a Channel, and invokes a Writer.

Job exits successfully when all TaskGroups finish (non‑zero exit code indicates failure).

Preparing the Environment

JDK 1.8 or higher (recommended 1.8)

Python 2 or 3

Apache Maven 3.x (only needed for compiling DataX from source)

Example host table:

Hostname

Package

MySQL-1

CentOS 7.4

192.168.1.1

jdk-8u181-linux-x64.tar.gz

datax.tar.gz

MySQL-2

CentOS 7.4

192.168.1.2

Installing JDK

# ls
anaconda-ks.cfg  jdk-8u181-linux-x64.tar.gz
# tar zxf jdk-8u181-linux-x64.tar.gz
# mv jdk1.8.0_181 /usr/local/java
# cat <<END >> /etc/profile
export JAVA_HOME=/usr/local/java
export PATH=$PATH:"$JAVA_HOME/bin"
END
# source /etc/profile
# java -version

CentOS 7 already includes Python 2.7, so no additional installation is required.

Installing DataX on Linux

# wget http://datax-opensource.oss-cn-hangzhou.aliyuncs.com/datax.tar.gz
# tar zxf datax.tar.gz -C /usr/local/
# rm -rf /usr/local/datax/plugin/*/._*
# cd /usr/local/datax/bin
# python datax.py ../job/job.json   # verify installation

Successful verification prints job statistics such as total records, speed, and duration.

Using DataX for Data Synchronization

DataX jobs are defined in JSON. A basic template for a stream reader/writer looks like:

{
  "job": {
    "content": [
      {
        "reader": {"name": "streamreader", "parameter": {}},
        "writer": {"name": "streamwriter", "parameter": {"print": true}}
      }
    ],
    "setting": {"speed": {"channel": ""}}
  }
}

For MySQL‑to‑MySQL full synchronization, create a JSON job specifying the source and target connections, columns, and optional pre‑SQL statements (e.g., truncate t_member). Example execution:

# python /usr/local/datax/bin/datax.py install.json

The job logs show total records (e.g., 2,999,999), throughput (≈2.57 MB/s), and completion time.

Incremental Synchronization

Incremental sync is achieved by adding a where clause to the reader configuration and optionally removing any preSql that would truncate the target table.

{
  "job": {
    "content": [
      {
        "reader": {
          "name": "mysqlreader",
          "parameter": {
            "username": "root",
            "password": "123123",
            "column": ["*"],
            "splitPk": "ID",
            "where": "ID <= 1888",
            "connection": [{"jdbcUrl": ["jdbc:mysql://192.168.1.1:3306/course-study?useUnicode=true&characterEncoding=utf8"], "table": ["t_member"]}]
          }
        },
        "writer": {
          "name": "mysqlwriter",
          "parameter": {
            "column": ["*"],
            "connection": [{"jdbcUrl": "jdbc:mysql://192.168.1.2:3306/course-study?useUnicode=true&characterEncoding=utf8", "table": ["t_member"]}],
            "password": "123123",
            "preSql": ["truncate t_member"],
            "session": ["set session sql_mode='ANSI'"],
            "username": "root",
            "writeMode": "insert"
          }
        }
      }
    ],
    "setting": {"speed": {"channel": "5"}}
  }
}

Running this job synchronizes only the rows matching the where condition, allowing efficient incremental updates.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Big Data DataX ETL Data Integration incremental sync

Written by

macrozheng

Dedicated to Java tech sharing and dissecting top open-source projects. Topics include Spring Boot, Spring Cloud, Docker, Kubernetes and more. Author’s GitHub project “mall” has 50K+ stars.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.