Big Data 15 min read

Master Fast Data Synchronization with Alibaba DataX: A Step‑by‑Step Guide

This article explains why traditional mysqldump and file‑based methods struggle with massive tables, introduces Alibaba DataX as a high‑performance offline data integration tool, details its architecture, and provides comprehensive installation and configuration steps for full and incremental MySQL‑to‑MySQL synchronization using JSON job files.

Programmer DD

Jul 14, 2022

Master Fast Data Synchronization with Alibaba DataX: A Step‑by‑Step Guide

Our project needed to sync 50 million rows between a business database and a reporting database, but mysqldump was too slow and file‑based storage could only handle a few thousand rows per hour.

mysqldump: backup and sync both take time, and data may still be generated during backup, effectively meaning no sync.

File‑based storage: extremely slow; three hours to sync only two thousand rows.

After researching, we discovered DataX, an open‑source version of Alibaba Cloud DataWorks, which offers fast, reliable offline data synchronization across heterogeneous sources such as MySQL, Oracle, HDFS, Hive, ODPS, HBase, and FTP.

1. DataX Overview

DataX transforms complex mesh‑like sync pipelines into a star‑shaped architecture, acting as a middle‑layer carrier that connects various data sources. Adding a new source only requires plugging it into DataX, enabling seamless synchronization.

1.1 DataX 3.0 Framework Design

DataX uses a Framework + Plugin architecture, abstracting data source reading and writing as Reader/Writer plugins.

1.2 DataX 3.0 Core Architecture

A DataX job is the central management node for a single sync task, handling data cleaning, task splitting, and TaskGroup management.

After a job starts, it splits into multiple small Tasks based on source split strategy for concurrent execution.

The Scheduler module assembles Tasks into TaskGroups according to the configured concurrency.

Each Task launches a fixed Reader → Channel → Writer thread chain to perform the sync.

The job monitors TaskGroups and exits successfully once all TaskGroups finish.

DataX scheduling process:

Job module splits the job into several Tasks according to sharding rules and calculates the required number of TaskGroups based on the user‑defined concurrency.

Task/Channel = TaskGroup; TaskGroups run Tasks with the allocated concurrency.

2. Using DataX for Data Synchronization

Prerequisites

JDK 1.8 or higher (recommended 1.8)

Python 2 or 3

Apache Maven 3.x (for compiling DataX; using the tar package does not require installation)

Install JDK (example for CentOS):

# ls
anaconda-ks.cfg  jdk-8u181-linux-x64.tar.gz
# tar zxf jdk-8u181-linux-x64.tar.gz
# mv jdk1.8.0_181 /usr/local/java
# cat <<END >> /etc/profile
export JAVA_HOME=/usr/local/java
export PATH=$PATH:"$JAVA_HOME/bin"
END
# source /etc/profile
# java -version

DataX installation on Linux:

# wget http://datax-opensource.oss-cn-hangzhou.aliyuncs.com/datax.tar.gz
# tar zxf datax.tar.gz -C /usr/local/
# rm -rf /usr/local/datax/plugin/*/._*

Verify installation:

# cd /usr/local/datax/bin
# python datax.py ../job/job.json

Output shows job start/end timestamps, total records, speed, and success status.

2.1 Basic DataX Usage

View the template for streamreader → streamwriter:

# python /usr/local/datax/bin/datax.py -r streamreader -w streamwriter

Resulting JSON template (simplified):

{
  "job": {
    "content": [{
      "reader": {"name": "streamreader", "parameter": {"column": [], "where": ""}},
      "writer": {"name": "streamwriter", "parameter": {"column": [], "encoding": "utf-8", "print": true}}
    }],
    "setting": {"speed": {"channel": ""}}
  }
}

2.2 Full‑Load MySQL‑to‑MySQL Sync

Create a job JSON file (install.json) specifying source and target MySQL connections, credentials, and optional pre‑SQL (e.g., truncate target table) and session settings.

{
  "job": {
    "content": [{
      "reader": {
        "name": "mysqlreader",
        "parameter": {
          "username": "root",
          "password": "123123",
          "column": ["*"],
          "splitPk": "ID",
          "connection": [{"jdbcUrl": ["jdbc:mysql://192.168.1.1:3306/course-study?useUnicode=true&characterEncoding=utf8"], "table": ["t_member"]}]
        }
      },
      "writer": {
        "name": "mysqlwriter",
        "parameter": {
          "column": ["*"],
          "connection": [{"jdbcUrl": "jdbc:mysql://192.168.1.2:3306/course-study?useUnicode=true&characterEncoding=utf8", "table": ["t_member"]}],
          "password": "123123",
          "preSql": ["truncate t_member"],
          "session": ["set session sql_mode='ANSI'"],
          "username": "root",
          "writeMode": "insert"
        }
      }
    }],
    "setting": {"speed": {"channel": "5"}}
  }
}

Run the job:

# python /usr/local/datax/bin/datax.py install.json

Output shows total records (≈3 million), speed (~2.57 MB/s), and successful completion.

2.3 Incremental Synchronization

The only difference between full and incremental sync is adding a where clause to filter records.

Example incremental job (where.json) with where": "ID <= 1888" and the same pre‑SQL removal:

{
  "job": {
    "content": [{
      "reader": {"name": "mysqlreader", "parameter": {"username": "root", "password": "123123", "column": ["*"], "splitPk": "ID", "where": "ID <= 1888", "connection": [{"jdbcUrl": ["jdbc:mysql://192.168.1.1:3306/course-study?useUnicode=true&characterEncoding=utf8"], "table": ["t_member"]}] }},
      "writer": {"name": "mysqlwriter", "parameter": {"column": ["*"], "connection": [{"jdbcUrl": "jdbc:mysql://192.168.1.2:3306/course-study?useUnicode=true&characterEncoding=utf8", "table": ["t_member"]}], "password": "123123", "preSql": [], "session": ["set session sql_mode='ANSI'"], "username": "root", "writeMode": "insert"}}
    }],
    "setting": {"speed": {"channel": "5"}}
  }
}

Run the incremental job and verify that only the filtered records (e.g., 1 888 rows) are transferred.

For subsequent incremental runs, adjust the where clause, e.g., "where": "ID > 1888 AND ID <= 2888", and remove the preSql that truncates the target table.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Big Data MySQL DataX ETL

Written by

Programmer DD

A tinkering programmer and author of "Spring Cloud Microservices in Action"

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.