Big Data 14 min read

How DataX Boosts Data‑Sync Speed by 200% Across Heterogeneous Sources

This article walks through the challenges of synchronizing 50 million rows between disparate MySQL databases, explains why traditional mysqldump or file‑based methods fail, and then details how the open‑source DataX tool—its 3.0 framework, installation steps, job architecture, and JSON‑based configurations—enables fast full and incremental data transfers with concrete performance metrics.

Programmer XiaoFu

Jun 18, 2025

How DataX Boosts Data‑Sync Speed by 200% Across Heterogeneous Sources

Synchronizing 50 million rows between a business database and a reporting database cannot rely on mysqldump or file‑based transfers because they are slow and may produce inconsistent data during backup.

DataX Overview

DataX is the open‑source version of Alibaba Cloud DataWorks Data Integration. It provides offline data synchronization for heterogeneous sources such as MySQL, Oracle, HDFS, Hive, ODPS, HBase and FTP, converting a mesh‑topology sync into a star topology so that adding a new source only requires connecting it to DataX.

Framework Design (DataX 3.0)

DataX follows a Framework + Plugin architecture:

Reader (采集模块) – collects data from a source and sends it to the Framework.

Writer (写入模块) – pulls data from the Framework and writes it to the destination.

Framework (中间商) – connects Reader and Writer, handling buffering, flow control, concurrency and data conversion.

Core Architecture of a DataX Job

When a Job starts, it splits the workload into multiple Task units according to the source’s split strategy.

The Scheduler groups Tasks into TaskGroup based on the configured concurrency.

Each Task launches a Reader → Channel → Writer thread chain to perform the actual synchronization.

The Job monitors all TaskGroups and exits with a non‑zero code only on failure.

Installation Prerequisites

JDK 1.8 or higher (recommended 1.8)

Python 2 or 3

Apache Maven 3.x (required only when compiling DataX; the binary tar package does not need Maven)

Installing DataX on Linux

# wget http://datax-opensource.oss-cn-hangzhou.aliyuncs.com/datax.tar.gz
# tar zxf datax.tar.gz -C /usr/local/
# rm -rf /usr/local/datax/plugin/*/._*   # delete hidden files that may cause errors

Verify the installation with:

# python /usr/local/datax/bin/datax.py -r streamreader -w streamwriter

Preparing the MySQL Test Environment

Create identical MySQL databases on two hosts, grant all privileges to root@'%' with password 123123, and generate three million rows using the following stored procedure:

DELIMITER $$
CREATE PROCEDURE test()
BEGIN
  DECLARE A INT DEFAULT 1;
  WHILE (A < 3000000) DO
    INSERT INTO `course-study`.t_member VALUES(A, CONCAT('LiSa',A), CONCAT('LiSa',A,'@163.com'));
    SET A = A + 1;
  END WHILE;
END $$
DELIMITER ;

Full‑Data Synchronization

Generate a template with

python /usr/local/datax/bin/datax.py -r mysqlreader -w mysqlwriter

and edit the JSON job file (example below) to specify source and target JDBC URLs, tables, credentials, and optional preSql and session settings.

{
  "job": {
    "content": [{
      "reader": {
        "name": "mysqlreader",
        "parameter": {
          "username": "root",
          "password": "123123",
          "column": ["*"],
          "splitPk": "ID",
          "connection": [{
            "jdbcUrl": ["jdbc:mysql://192.168.1.1:3306/course-study?useUnicode=true&characterEncoding=utf8"],
            "table": ["t_member"]
          }]
        }
      },
      "writer": {
        "name": "mysqlwriter",
        "parameter": {
          "username": "root",
          "password": "123123",
          "column": ["*"],
          "connection": [{
            "jdbcUrl": ["jdbc:mysql://192.168.1.2:3306/course-study?useUnicode=true&characterEncoding=utf8"],
            "table": ["t_member"]
          }],
          "preSql": ["truncate t_member"],
          "session": ["set session sql_mode='ANSI'"],
          "writeMode": "insert"
        }
      }
    }],
    "setting": {"speed": {"channel": "5"}}
  }
}

Run the job:

# python /usr/local/datax/bin/datax.py install.json

Sample output shows processing of 2,999,999 records at 2.57 MB/s (≈ 75 k records/s) with a total runtime of 42 seconds.

Incremental Synchronization

Incremental sync is achieved by adding a where clause to the Reader configuration. The JSON below filters rows with ID <= 1888 and removes the preSql that truncates the target table.

{
  "job": {
    "content": [{
      "reader": {
        "name": "mysqlreader",
        "parameter": {
          "username": "root",
          "password": "123123",
          "column": ["*"],
          "splitPk": "ID",
          "where": "ID <= 1888",
          "connection": [{
            "jdbcUrl": ["jdbc:mysql://192.168.1.1:3306/course-study?useUnicode=true&characterEncoding=utf8"],
            "table": ["t_member"]
          }]
        }
      },
      "writer": {
        "name": "mysqlwriter",
        "parameter": {
          "username": "root",
          "password": "123123",
          "column": ["*"],
          "connection": [{
            "jdbcUrl": ["jdbc:mysql://192.168.1.2:3306/course-study?useUnicode=true&characterEncoding=utf8"],
            "table": ["t_member"]
          }],
          "preSql": ["truncate t_member"],
          "session": ["set session sql_mode='ANSI'"],
          "writeMode": "insert"
        }
      }
    }],
    "setting": {"speed": {"channel": "5"}}
  }
}

Running this job processes 1,888 records at 1.61 KB/s (≈ 62 records/s) in 32 seconds, demonstrating the low‑overhead nature of incremental sync.

Key Takeaways

Full synchronization is straightforward but may be interrupted on very large datasets.

Incremental sync, driven by the where parameter, efficiently handles ongoing data changes.

DataX’s plug‑in architecture, job/task grouping, and configurable concurrency make it suitable for both batch and incremental scenarios.

Comparison of middleware for heterogeneous data sources

Stored procedure for generating test data

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

MySQL data synchronization DataX incremental sync big data integration heterogeneous data sources

Written by

Programmer XiaoFu

xiaofucode.com – a programmer learning guide driven by the pursuit of profit

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.

DataX Overview

Framework Design (DataX 3.0)

Core Architecture of a DataX Job

Installation Prerequisites

Installing DataX on Linux

Preparing the MySQL Test Environment

Full‑Data Synchronization

Incremental Synchronization

Key Takeaways

Programmer XiaoFu

How this landed with the community

Was this worth your time?

0 Comments

Framework Design (DataX 3.0)