Step-by-Step Guide to Deploying and Using DataX‑web for Data Synchronization
This article provides a comprehensive tutorial on preparing the environment, installing DataX and DataX‑web, configuring MySQL, JDK, Maven, and Python, deploying the services on Linux, and using the web UI to create data sources, build JSON jobs, monitor execution, and manage users.
Background : To synchronize external and internal production data, the built‑in DataWorks sync module cannot operate in mixed network environments, so the author evaluates two alternatives—DataX‑web and DolphinScheduler—and focuses on the DataX‑web deployment process.
1. Environment preparation
Install required software: MySQL (5.5+), JDK 1.8, Maven 3.6.1+, DataX, and Python 2.x (or replace the three Python scripts under datax/bin for Python 3 support).
2. Install DataX
Download the DataX tarball from http://datax-opensource.oss-cn-hangzhou.aliyuncs.com/datax.tar.gz , extract it, and run a sync job with:
$ cd {YOUR_DATAX_HOME}/bin
$ python datax.py {YOUR_JOB.json}Verify installation via the provided self‑check script.
3. Install DataX‑web
Obtain the official tar package (e.g., from Baidu Cloud) or clone the source from Git and run mvn clean install to generate build/datax-web-{VERSION}.tar.gz . Extract the package:
tar -zxvf datax-web-{VERSION}.tar.gz
mv datax-web-2.1.2 datax-webRun the one‑click install script sh install.sh --force or execute the interactive install.sh to configure database connection, mail service, and other properties.
4. Database initialization
If MySQL is available, the installer will prompt for host, port, username, password, and database name; otherwise, manually execute /bin/db/datax-web.sql and edit modules/datax-admin/bin/env.properties accordingly.
5. Configuration
Edit /modules/datax-admin/bin/env.properties for mail settings and /modules/datax-execute/bin/env.properties for PYTHON_PATH and DATAX_ADMIN_PORT . Adjust other default parameters such as server.port , executor.port , etc.
6. Service startup
Start all services with the provided script, verify processes using jps (look for DataXAdminApplication and DataXExecutorApplication ), and check logs in modules/*/console.out . Use ./bin/start.sh -m {module_name} to start a single module, or ./bin/stop.sh -m {module_name} to stop it.
7. Cluster deployment
Ensure consistent DB configuration and clock across nodes. For executor clusters, keep admin.addresses and executor.appname identical.
8. Using DataX‑web
Through the web UI you can configure executors, create data sources (Hive, MySQL, Oracle, PostgreSQL, SQLServer, HBase, MongoDB, ClickHouse), build JSON job scripts, batch‑create tasks, monitor execution, view logs, and manage users.
9. Task execution policies
Choose blocking strategies such as single‑machine serial, discard subsequent schedules, or overwrite previous schedules; set appropriate retry counts to avoid data duplication.
Conclusion
For modest data volumes and limited budgets, DataX‑web is a viable solution; future articles will cover DolphinScheduler integration and related troubleshooting.
Big Data Technology Architecture
Exploring Open Source Big Data and AI Technologies
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.