Boost Data Sync Speed 8‑10×: Integrating Alibaba DataX into Spring Boot
This article explains how to replace a slow Kettle‑based ETL process with Alibaba DataX, covering environment setup, compilation, Maven integration, Java invocation, and performance results that show a ten‑fold speed increase for syncing over a million records.
Our system stores base data in EBS and originally used a scheduled plus manual sync via Kettle scripts, which became slow as data grew to over a million rows, taking more than ten minutes per run.
Because Kettle was simple but not scalable, we switched to Alibaba's open‑source DataX for faster ETL.
Development environment: Windows 10, JDK 1.8, Maven 3.6.3; runtime on Linux 3.10 with Python 2.7.5.
We cloned DataX from GitHub and built it with
mvn -U clean package assembly:assembly -Dmaven.test.skip=true. The resulting package (~1.32 GB) was uploaded to the server.
Using the provided Python script we generated a job template (e.g.,
python datax.py -r oraclereader -w oraclewriter > oracleetl.json) and customized it for our Oracle‑to‑Oracle sync.
In the Spring Boot project we added the required DataX JARs as system‑scoped dependencies:
<dependency>
<groupId>com.alibaba.datax</groupId>
<artifactId>data-common</artifactId>
<version>0.01-SNAPSHOT</version>
<scope>system</scope>
<systemPath>${project.basedir}/src/main/webapp/WEB-INF/lib/datax-common-0.0.1-SNAPSHOT.jar</systemPath>
</dependency>
... (other dependencies)Because DataX artifacts are not in public Maven repositories, we referenced the local JARs directly.
We invoked DataX from Java code:
System.setProperty("datax.home", DATAX_HOME);
String[] args = {"-job", DATAX_HOME+"/etl_itemdata.json", "-mode", "standalone", "-jobid", "-1"};
Engine.entry(args);Running the job produced the following log, showing that 1,335,288 records were synchronized in 60 seconds, achieving an average throughput of 2.33 MB/s and 22,254 records/s.
Compared with Kettle, the sync speed increased by 8‑10×, meeting business requirements.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Java High-Performance Architecture
Sharing Java development articles and resources, including SSM architecture and the Spring ecosystem (Spring Boot, Spring Cloud, MyBatis, Dubbo, Docker), Zookeeper, Redis, architecture design, microservices, message queues, Git, etc.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
