Boost Data Sync Speed 8‑10×: Integrating Alibaba DataX into Spring Boot

This article explains how to replace a slow Kettle‑based ETL process with Alibaba DataX, covering environment setup, compilation, Maven integration, Java invocation, and performance results that show a ten‑fold speed increase for syncing over a million records.

Java High-Performance Architecture
Java High-Performance Architecture
Java High-Performance Architecture
Boost Data Sync Speed 8‑10×: Integrating Alibaba DataX into Spring Boot

Our system stores base data in EBS and originally used a scheduled plus manual sync via Kettle scripts, which became slow as data grew to over a million rows, taking more than ten minutes per run.

Because Kettle was simple but not scalable, we switched to Alibaba's open‑source DataX for faster ETL.

Development environment: Windows 10, JDK 1.8, Maven 3.6.3; runtime on Linux 3.10 with Python 2.7.5.

We cloned DataX from GitHub and built it with

mvn -U clean package assembly:assembly -Dmaven.test.skip=true

. The resulting package (~1.32 GB) was uploaded to the server.

Using the provided Python script we generated a job template (e.g.,

python datax.py -r oraclereader -w oraclewriter > oracleetl.json

) and customized it for our Oracle‑to‑Oracle sync.

DataX compilation output
DataX compilation output

In the Spring Boot project we added the required DataX JARs as system‑scoped dependencies:

<dependency>
    <groupId>com.alibaba.datax</groupId>
    <artifactId>data-common</artifactId>
    <version>0.01-SNAPSHOT</version>
    <scope>system</scope>
    <systemPath>${project.basedir}/src/main/webapp/WEB-INF/lib/datax-common-0.0.1-SNAPSHOT.jar</systemPath>
</dependency>
... (other dependencies)

Because DataX artifacts are not in public Maven repositories, we referenced the local JARs directly.

We invoked DataX from Java code:

System.setProperty("datax.home", DATAX_HOME);
String[] args = {"-job", DATAX_HOME+"/etl_itemdata.json", "-mode", "standalone", "-jobid", "-1"};
Engine.entry(args);

Running the job produced the following log, showing that 1,335,288 records were synchronized in 60 seconds, achieving an average throughput of 2.33 MB/s and 22,254 records/s.

DataX execution log
DataX execution log

Compared with Kettle, the sync speed increased by 8‑10×, meeting business requirements.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

javaSpring BootDataXETLKettle
Java High-Performance Architecture
Written by

Java High-Performance Architecture

Sharing Java development articles and resources, including SSM architecture and the Spring ecosystem (Spring Boot, Spring Cloud, MyBatis, Dubbo, Docker), Zookeeper, Redis, architecture design, microservices, message queues, Git, etc.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.