Big Data 15 min read

Master DataX: Fast Offline Data Sync for MySQL without mysqldump

This guide explains how to use Alibaba's open‑source DataX tool to perform high‑performance offline synchronization between heterogeneous MySQL databases, covering installation, framework design, job configuration, full‑ and incremental sync, and practical command‑line examples.

Java Backend Technology
Java Backend Technology
Java Backend Technology
Master DataX: Fast Offline Data Sync for MySQL without mysqldump

Preface

A project with 50 million rows required reliable data synchronization between a business database and a reporting database, but cross‑database SQL queries were impractical.

Initial ideas such as mysqldump or simple file storage proved too slow or unreliable.

What Is DataX?

DataX is the open‑source version of Alibaba Cloud DataWorks data integration, designed for offline data synchronization across heterogeneous sources such as MySQL, Oracle, HDFS, Hive, ODPS, HBase, FTP, etc.

It transforms complex mesh‑like sync topologies into a star‑shaped architecture, where DataX acts as the middle‑man connecting various data sources.

DataX 3.0 Framework Design

DataX follows a Framework + Plugin model. Readers and Writers are implemented as plugins that plug into the central Framework.

The core components are:

Reader (采集模块) : collects data from a source and sends it to the Framework.

Writer (写入模块) : pulls data from the Framework and writes it to the target.

Framework (中间商) : connects Reader and Writer, handling buffering, flow control, concurrency, and data conversion.

DataX 3.0 Core Architecture

A DataX job is the unit of work. When a job starts, it spawns a process that manages the whole sync lifecycle, including data cleaning, task splitting, and TaskGroup management.

DataX splits the job into multiple Task instances based on the source’s split strategy.

The Scheduler assembles Tasks into TaskGroup according to the configured concurrency.

Each Task launches a Reader → Channel → Writer pipeline.

The Job monitors all TaskGroups and exits successfully when every group finishes (non‑zero exit code indicates failure).

Using DataX for Data Synchronization

Prerequisites

JDK 1.8 or higher (recommended 1.8)

Python 2 or 3 (CentOS 7 already includes Python 2.7)

Apache Maven 3.x (only needed for compiling DataX; binary tarball can be used without Maven)

Install JDK

# ls
anaconda-ks.cfg  jdk-8u181-linux-x64.tar.gz
# tar zxf jdk-8u181-linux-x64.tar.gz
# mv jdk1.8.0_181 /usr/local/java
# echo "export JAVA_HOME=/usr/local/java" >> /etc/profile
# echo "export PATH=\$PATH:\"\$JAVA_HOME/bin\"" >> /etc/profile
# source /etc/profile
# java -version

CentOS 7 already provides Python 2.7, so no extra installation is required.

Install DataX

# wget http://datax-opensource.oss-cn-hangzhou.aliyuncs.com/datax.tar.gz
# tar zxf datax.tar.gz -C /usr/local/
# rm -rf /usr/local/datax/plugin/*/._*   # delete hidden files

Verify installation:

# cd /usr/local/datax/bin
# python datax.py ../job/job.json   # should print job info

Basic DataX Usage

Show the template for a stream reader/writer:

# python /usr/local/datax/bin/datax.py -r streamreader -w streamwriter

The command prints usage instructions and a JSON skeleton.

Installing MySQL on Two Hosts

# yum -y install mariadb mariadb-server mariadb-libs mariadb-devel
# systemctl start mariadb
# mysql_secure_installation   # set root password, remove anonymous users, etc.

Create identical tables on both hosts:

CREATE DATABASE `course-study`;
CREATE TABLE `course-study`.t_member(ID INT, Name VARCHAR(20), Email VARCHAR(30));

Grant full privileges for DataX:

GRANT ALL PRIVILEGES ON *.* TO 'root'@'%' IDENTIFIED BY '123123';
FLUSH PRIVILEGES;

Full‑Volume Sync with DataX

Generate a MySQL‑to‑MySQL job template:

# python /usr/local/datax/bin/datax.py -r mysqlreader -w mysqlwriter
{
  "job": {
    "content": [{
      "reader": {"name": "mysqlreader", "parameter": {"column": [], "connection": [{"jdbcUrl": [], "table": []}], "username": "", "password": "", "where": ""}},
      "writer": {"name": "mysqlwriter", "parameter": {"column": [], "connection": [{"jdbcUrl": "", "table": []}], "username": "", "password": "", "preSql": [], "session": [], "writeMode": ""}}
    }],
    "setting": {"speed": {"channel": ""}}
  }
}

Fill in the JSON with actual connection strings, credentials, and optional pre‑SQL (e.g., truncate t_member).

Run the job:

# python /usr/local/datax/bin/datax.py install.json

Sample output shows processing of ~3 million records at ~2.5 MB/s.

Incremental Sync

Incremental sync only differs by adding a where clause to filter rows.

{
  "job": {
    "content": [{
      "reader": {"name": "mysqlreader", "parameter": {"username": "root", "password": "123123", "column": ["*"], "splitPk": "ID", "where": "ID <= 1888", "connection": [{"jdbcUrl": ["jdbc:mysql://192.168.1.1:3306/course-study?useUnicode=true&characterEncoding=utf8"], "table": ["t_member"]}] }},
      "writer": {"name": "mysqlwriter", "parameter": {"column": ["*"], "connection": [{"jdbcUrl": "jdbc:mysql://192.168.1.2:3306/course-study?useUnicode=true&characterEncoding=utf8", "table": ["t_member"]}], "username": "root", "password": "123123", "writeMode": "insert"}}
    }],
    "setting": {"speed": {"channel": "5"}}
  }
}

Run the incremental job similarly; the output confirms only the filtered rows were transferred.

Key Takeaways

DataX provides a robust, star‑topology solution for large‑scale offline data sync.

Full sync works well for initial loads; incremental sync requires proper where conditions and may need to remove destructive preSql statements.

The framework’s plug‑in architecture makes it easy to extend to other data sources.

Author: 愿许浪尽天涯 (source: blog.csdn.net/weixin_46902396/article/details/121904705)

big dataLinuxMySQLData SynchronizationDataXETLIncremental Sync
Java Backend Technology
Written by

Java Backend Technology

Focus on Java-related technologies: SSM, Spring ecosystem, microservices, MySQL, MyCat, clustering, distributed systems, middleware, Linux, networking, multithreading. Occasionally cover DevOps tools like Jenkins, Nexus, Docker, and ELK. Also share technical insights from time to time, committed to Java full-stack development!

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.