Big Data 14 min read

How DataX Boosts Data‑Sync Speed by 200% Across Heterogeneous Sources

This article walks through the challenges of synchronizing 50 million rows between disparate MySQL databases, explains why traditional mysqldump or file‑based methods fail, and then details how the open‑source DataX tool—its 3.0 framework, installation steps, job architecture, and JSON‑based configurations—enables fast full and incremental data transfers with concrete performance metrics.

Programmer XiaoFu
Programmer XiaoFu
Programmer XiaoFu
How DataX Boosts Data‑Sync Speed by 200% Across Heterogeneous Sources

Synchronizing 50 million rows between a business database and a reporting database cannot rely on mysqldump or file‑based transfers because they are slow and may produce inconsistent data during backup.

DataX Overview

DataX is the open‑source version of Alibaba Cloud DataWorks Data Integration. It provides offline data synchronization for heterogeneous sources such as MySQL, Oracle, HDFS, Hive, ODPS, HBase and FTP, converting a mesh‑topology sync into a star topology so that adding a new source only requires connecting it to DataX.

Framework Design (DataX 3.0)

DataX follows a Framework + Plugin architecture:

Reader (采集模块) – collects data from a source and sends it to the Framework.

Writer (写入模块) – pulls data from the Framework and writes it to the destination.

Framework (中间商) – connects Reader and Writer, handling buffering, flow control, concurrency and data conversion.

Core Architecture of a DataX Job

When a Job starts, it splits the workload into multiple Task units according to the source’s split strategy.

The Scheduler groups Tasks into TaskGroup based on the configured concurrency.

Each Task launches a Reader → Channel → Writer thread chain to perform the actual synchronization.

The Job monitors all TaskGroups and exits with a non‑zero code only on failure.

Installation Prerequisites

JDK 1.8 or higher (recommended 1.8)

Python 2 or 3

Apache Maven 3.x (required only when compiling DataX; the binary tar package does not need Maven)

Installing DataX on Linux

# wget http://datax-opensource.oss-cn-hangzhou.aliyuncs.com/datax.tar.gz
# tar zxf datax.tar.gz -C /usr/local/
# rm -rf /usr/local/datax/plugin/*/._*   # delete hidden files that may cause errors

Verify the installation with:

# python /usr/local/datax/bin/datax.py -r streamreader -w streamwriter

Preparing the MySQL Test Environment

Create identical MySQL databases on two hosts, grant all privileges to root@'%' with password 123123, and generate three million rows using the following stored procedure:

DELIMITER $$
CREATE PROCEDURE test()
BEGIN
  DECLARE A INT DEFAULT 1;
  WHILE (A < 3000000) DO
    INSERT INTO `course-study`.t_member VALUES(A, CONCAT('LiSa',A), CONCAT('LiSa',A,'@163.com'));
    SET A = A + 1;
  END WHILE;
END $$
DELIMITER ;

Full‑Data Synchronization

Generate a template with

python /usr/local/datax/bin/datax.py -r mysqlreader -w mysqlwriter

and edit the JSON job file (example below) to specify source and target JDBC URLs, tables, credentials, and optional preSql and session settings.

{
  "job": {
    "content": [{
      "reader": {
        "name": "mysqlreader",
        "parameter": {
          "username": "root",
          "password": "123123",
          "column": ["*"],
          "splitPk": "ID",
          "connection": [{
            "jdbcUrl": ["jdbc:mysql://192.168.1.1:3306/course-study?useUnicode=true&characterEncoding=utf8"],
            "table": ["t_member"]
          }]
        }
      },
      "writer": {
        "name": "mysqlwriter",
        "parameter": {
          "username": "root",
          "password": "123123",
          "column": ["*"],
          "connection": [{
            "jdbcUrl": ["jdbc:mysql://192.168.1.2:3306/course-study?useUnicode=true&characterEncoding=utf8"],
            "table": ["t_member"]
          }],
          "preSql": ["truncate t_member"],
          "session": ["set session sql_mode='ANSI'"],
          "writeMode": "insert"
        }
      }
    }],
    "setting": {"speed": {"channel": "5"}}
  }
}

Run the job:

# python /usr/local/datax/bin/datax.py install.json

Sample output shows processing of 2,999,999 records at 2.57 MB/s (≈ 75 k records/s) with a total runtime of 42 seconds.

Incremental Synchronization

Incremental sync is achieved by adding a where clause to the Reader configuration. The JSON below filters rows with ID <= 1888 and removes the preSql that truncates the target table.

{
  "job": {
    "content": [{
      "reader": {
        "name": "mysqlreader",
        "parameter": {
          "username": "root",
          "password": "123123",
          "column": ["*"],
          "splitPk": "ID",
          "where": "ID <= 1888",
          "connection": [{
            "jdbcUrl": ["jdbc:mysql://192.168.1.1:3306/course-study?useUnicode=true&characterEncoding=utf8"],
            "table": ["t_member"]
          }]
        }
      },
      "writer": {
        "name": "mysqlwriter",
        "parameter": {
          "username": "root",
          "password": "123123",
          "column": ["*"],
          "connection": [{
            "jdbcUrl": ["jdbc:mysql://192.168.1.2:3306/course-study?useUnicode=true&characterEncoding=utf8"],
            "table": ["t_member"]
          }],
          "preSql": ["truncate t_member"],
          "session": ["set session sql_mode='ANSI'"],
          "writeMode": "insert"
        }
      }
    }],
    "setting": {"speed": {"channel": "5"}}
  }
}

Running this job processes 1,888 records at 1.61 KB/s (≈ 62 records/s) in 32 seconds, demonstrating the low‑overhead nature of incremental sync.

Key Takeaways

Full synchronization is straightforward but may be interrupted on very large datasets.

Incremental sync, driven by the where parameter, efficiently handles ongoing data changes.

DataX’s plug‑in architecture, job/task grouping, and configurable concurrency make it suitable for both batch and incremental scenarios.

Comparison of middleware for heterogeneous data sources
Comparison of middleware for heterogeneous data sources
DataX architecture diagram
DataX architecture diagram
DataX framework components
DataX framework components
Job, Task, TaskGroup workflow
Job, Task, TaskGroup workflow
Stored procedure for generating test data
Stored procedure for generating test data
Result of incremental sync
Result of incremental sync
Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

MySQLdata synchronizationDataXincremental syncbig data integrationheterogeneous data sources
Programmer XiaoFu
Written by

Programmer XiaoFu

xiaofucode.com – a programmer learning guide driven by the pursuit of profit

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.