Big Data 14 min read

Comprehensive Guide to DataX: Introduction, Architecture, Usage, and Deployment

This article provides a detailed overview of DataX, covering its purpose, framework design, core architecture, scheduling process, practical examples of MySQL-to-MySQL synchronization, step‑by‑step installation and configuration of DataX‑WEB, UI usage, routing strategies, task types, and advanced task building techniques.

Big Data Technology & Architecture

Aug 4, 2022

Comprehensive Guide to DataX: Introduction, Architecture, Usage, and Deployment

DataX is the open‑source version of Alibaba Cloud DataWorks data integration, designed for offline synchronization between heterogeneous data sources such as relational databases (MySQL, Oracle), HDFS, Hive, ODPS, HBase, and FTP.

It simplifies complex mesh synchronization topologies into a star‑shaped data link, acting as a middle‑layer carrier that connects various data sources; adding a new source only requires integrating it with DataX.

DataX 3.0 Framework Design

DataX follows a Framework + Plugin architecture, abstracting data source reading and writing as Reader/Writer plugins.

DataX 3.0 Core Architecture

A single synchronization job is called a Job. Upon receiving a Job, DataX launches a process that handles data cleaning, task splitting, and TaskGroup management. The Job is divided into multiple Tasks for concurrent execution, which are then assembled into TaskGroups based on the configured concurrency.

Each Task follows a fixed execution flow: Reader → Channel → Writer.

DataX Scheduling Process

The Job splits data according to sharding rules, calculates the required number of TaskGroups (Task / Channel = TaskGroup), and runs Tasks within each TaskGroup according to the concurrency settings.

Using DataX for Data Synchronization

Example: generating a MySQL‑to‑MySQL synchronization template.

# Output MySQL configuration template
[root@192 bin]# python /usr/local/datax/bin/datax.py -r mysqlreader -w mysqlwriter > /usr/local/datax/job/mysql2mysql.json

# Edit the generated mysql2mysql.json file
{
  "job": {
    "content": [
      {
        "reader": {
          "name": "mysqlreader",
          "parameter": {
            "column": ["id","name"],
            "connection": [{
              "jdbcUrl": ["jdbc:mysql://x.x.x.210:3306/mytest"],
              "table": ["user"]
            }],
            "password": "root",
            "username": "root"
          }
        },
        "writer": {
          "name": "mysqlwriter",
          "parameter": {
            "column": ["id","name"],
            "connection": [{
              "jdbcUrl": "jdbc:mysql://192.168.88.192:3306/mytest",
              "table": ["user"]
            }],
            "password": "root",
            "username": "root",
            "writeMode": "insert"
          }
        }
      }
    ],
    "setting": {
      "speed": {"channel": "6"}
    }
  }
}

Validation command and sample output:

# python /usr/local/datax/bin/datax.py mysql2mysql.json
2022-04-24 17:39:03.445 [job-0] INFO  JobContainer - 
任务启动时刻: 2022-04-24 17:38:49
任务结束时刻: 2022-04-24 17:39:03
任务总计耗时: 14s
记录写入速度: 0rec/s
读出记录总数: 3
读写失败总数: 0

DataX‑WEB Installation and Deployment

Repository: https://github.com/WeiYe-Jing/datax-web

Steps:

Extract the installation package to the desired directory.

Create a MySQL database (e.g., create database dataxweb;).

Run the one‑click install script:

# cd bin/
# ./install.sh

If MySQL is not installed, manually execute modules/datax-admin/bin/db/datax-web.sql and edit modules/datax-admin/conf/bootstrap.properties to set DB_HOST, DB_PORT, DB_USERNAME, DB_PASSWORD, and DB_DATABASE.

Configure optional settings such as email service in modules/datax-admin/bin/env.properties and Python path in modules/datax-executor/bin/env.properties.

Start all services with # ./bin/start-all.sh (or start individual modules with # ./bin/start.sh -m {module_name}).

Stop services with # ./bin/stop-all.sh (or # ./bin/stop.sh -m {module_name}).

Verify running processes using the JPS command to see DataXAdminApplication and DataXExecutorApplication. Check logs in modules/datax-admin/bin/console.out or modules/datax-executor/bin/console.out for troubleshooting.

DataX‑WEB Operation

Access the UI at http://ip:port/index.html (default port 9527) and log in with username admin and password 123456. API documentation is available at http://ip:port/doc.html.

Routing Strategies

When deploying executor clusters, DataX‑WEB provides various routing strategies:

FIRST (first machine)
LAST (last machine)
ROUND (round‑robin)
RANDOM (random online machine)
CONSISTENT_HASH (hash‑based fixed selection)
LEAST_FREQUENTLY_USED (least used)
LEAST_RECENTLY_USED (longest idle)
FAILOVER (first successful heartbeat)
BUSYOVER (first idle machine)

Blocking handling strategies include:

Single‑machine serial: FIFO queue, tasks run serially.

Discard subsequent schedules: new tasks are dropped if a task is already running.

Overwrite previous schedule: running task is terminated and queue cleared before new task runs.

For incremental sync, it is recommended to use “discard subsequent schedules” or “single‑machine serial”.

Task Types and Configuration

Select DataX task, configure source and target parameters, and optionally edit JSON directly for advanced settings. Example JSON for a MySQL source with incremental condition:

{
  "job": {
    "content": [{
      "reader": {
        "name": "mysqlreader",
        "parameter": {
          "username": "root",
          "password": "root",
          "column": ["*"],
          "where": " save_time >= FROM_UNIXTIME(${lastTime}) and save_time < FROM_UNIXTIME(${currentTime})",
          "splitPk": "id",
          "connection": [{
            "table": ["uc_op_amazon_api_store_download"],
            "jdbcUrl": ["jdbc:mysql://x.x.x.210:3306/test_system"]
          }]
        }
      },
      "writer": {
        "name": "mysqlwriter",
        "parameter": {
          "writeMode": "insert",
          "username": "root",
          "password": "root",
          "column": ["*"],
          "connection": [{
            "jdbcUrl": "jdbc:mysql://192.168.88.192:3306/mytest?useUnicode=true&characterEncoding=utf8",
            "table": ["uc_op_amazon_api_store_download"]
          }]
        }
      }
    }],
    "setting": {"speed": {"channel": 6}}
  }
}

Additional command‑line parameters for incremental ID range:

-DstartId='${startId}' -DendId='${endId}'
# Table name
uc_op_business_reports
# Primary key
id

The article concludes with screenshots of the UI for executor monitoring, project creation, routing configuration, and task building.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Big Data Scheduling DataX ETL Installation Data Integration DataX-Web

Written by

Big Data Technology & Architecture

Wang Zhiwu, a big data expert, dedicated to sharing big data technology.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.