Comprehensive Guide to DataX: Introduction, Architecture, Usage, and Deployment
This article provides a detailed overview of DataX, covering its purpose, framework design, core architecture, scheduling process, practical examples of MySQL-to-MySQL synchronization, step‑by‑step installation and configuration of DataX‑WEB, UI usage, routing strategies, task types, and advanced task building techniques.
DataX is the open‑source version of Alibaba Cloud DataWorks data integration, designed for offline synchronization between heterogeneous data sources such as relational databases (MySQL, Oracle), HDFS, Hive, ODPS, HBase, and FTP.
It simplifies complex mesh synchronization topologies into a star‑shaped data link, acting as a middle‑layer carrier that connects various data sources; adding a new source only requires integrating it with DataX.
DataX 3.0 Framework Design
DataX follows a Framework + Plugin architecture, abstracting data source reading and writing as Reader/Writer plugins.
DataX 3.0 Core Architecture
A single synchronization job is called a Job. Upon receiving a Job, DataX launches a process that handles data cleaning, task splitting, and TaskGroup management. The Job is divided into multiple Tasks for concurrent execution, which are then assembled into TaskGroups based on the configured concurrency.
Each Task follows a fixed execution flow: Reader → Channel → Writer.
DataX Scheduling Process
The Job splits data according to sharding rules, calculates the required number of TaskGroups (Task / Channel = TaskGroup), and runs Tasks within each TaskGroup according to the concurrency settings.
Using DataX for Data Synchronization
Example: generating a MySQL‑to‑MySQL synchronization template.
# Output MySQL configuration template
[root@192 bin]# python /usr/local/datax/bin/datax.py -r mysqlreader -w mysqlwriter > /usr/local/datax/job/mysql2mysql.json
# Edit the generated mysql2mysql.json file
{
"job": {
"content": [
{
"reader": {
"name": "mysqlreader",
"parameter": {
"column": ["id","name"],
"connection": [{
"jdbcUrl": ["jdbc:mysql://x.x.x.210:3306/mytest"],
"table": ["user"]
}],
"password": "root",
"username": "root"
}
},
"writer": {
"name": "mysqlwriter",
"parameter": {
"column": ["id","name"],
"connection": [{
"jdbcUrl": "jdbc:mysql://192.168.88.192:3306/mytest",
"table": ["user"]
}],
"password": "root",
"username": "root",
"writeMode": "insert"
}
}
}
],
"setting": {
"speed": {"channel": "6"}
}
}
}Validation command and sample output:
# python /usr/local/datax/bin/datax.py mysql2mysql.json
2022-04-24 17:39:03.445 [job-0] INFO JobContainer -
任务启动时刻: 2022-04-24 17:38:49
任务结束时刻: 2022-04-24 17:39:03
任务总计耗时: 14s
记录写入速度: 0rec/s
读出记录总数: 3
读写失败总数: 0DataX‑WEB Installation and Deployment
Repository: https://github.com/WeiYe-Jing/datax-web
Steps:
Extract the installation package to the desired directory.
Create a MySQL database (e.g., create database dataxweb;).
Run the one‑click install script:
# cd bin/
# ./install.shIf MySQL is not installed, manually execute modules/datax-admin/bin/db/datax-web.sql and edit modules/datax-admin/conf/bootstrap.properties to set DB_HOST, DB_PORT, DB_USERNAME, DB_PASSWORD, and DB_DATABASE.
Configure optional settings such as email service in modules/datax-admin/bin/env.properties and Python path in modules/datax-executor/bin/env.properties.
Start all services with # ./bin/start-all.sh (or start individual modules with # ./bin/start.sh -m {module_name}).
Stop services with # ./bin/stop-all.sh (or # ./bin/stop.sh -m {module_name}).
Verify running processes using the JPS command to see DataXAdminApplication and DataXExecutorApplication. Check logs in modules/datax-admin/bin/console.out or modules/datax-executor/bin/console.out for troubleshooting.
DataX‑WEB Operation
Access the UI at http://<i>ip</i>:<i>port</i>/index.html (default port 9527) and log in with username admin and password 123456. API documentation is available at http://<i>ip</i>:<i>port</i>/doc.html.
Routing Strategies
When deploying executor clusters, DataX‑WEB provides various routing strategies:
FIRST (first machine)
LAST (last machine)
ROUND (round‑robin)
RANDOM (random online machine)
CONSISTENT_HASH (hash‑based fixed selection)
LEAST_FREQUENTLY_USED (least used)
LEAST_RECENTLY_USED (longest idle)
FAILOVER (first successful heartbeat)
BUSYOVER (first idle machine)Blocking handling strategies include:
Single‑machine serial: FIFO queue, tasks run serially.
Discard subsequent schedules: new tasks are dropped if a task is already running.
Overwrite previous schedule: running task is terminated and queue cleared before new task runs.
For incremental sync, it is recommended to use “discard subsequent schedules” or “single‑machine serial”.
Task Types and Configuration
Select DataX task, configure source and target parameters, and optionally edit JSON directly for advanced settings. Example JSON for a MySQL source with incremental condition:
{
"job": {
"content": [{
"reader": {
"name": "mysqlreader",
"parameter": {
"username": "root",
"password": "root",
"column": ["*"],
"where": " save_time >= FROM_UNIXTIME(${lastTime}) and save_time < FROM_UNIXTIME(${currentTime})",
"splitPk": "id",
"connection": [{
"table": ["uc_op_amazon_api_store_download"],
"jdbcUrl": ["jdbc:mysql://x.x.x.210:3306/test_system"]
}]
}
},
"writer": {
"name": "mysqlwriter",
"parameter": {
"writeMode": "insert",
"username": "root",
"password": "root",
"column": ["*"],
"connection": [{
"jdbcUrl": "jdbc:mysql://192.168.88.192:3306/mytest?useUnicode=true&characterEncoding=utf8",
"table": ["uc_op_amazon_api_store_download"]
}]
}
}
}],
"setting": {"speed": {"channel": 6}}
}
}Additional command‑line parameters for incremental ID range:
-DstartId='${startId}' -DendId='${endId}'
# Table name
uc_op_business_reports
# Primary key
idThe article concludes with screenshots of the UI for executor monitoring, project creation, routing configuration, and task building.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Big Data Technology & Architecture
Wang Zhiwu, a big data expert, dedicated to sharing big data technology.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
