How Apache SeaTunnel Redefines Data Integration for Modern Data Platforms
This article reviews the evolution of data‑integration architectures toward EtLT, explains the core capabilities of Apache SeaTunnel, and details how a Chinese data‑platform vendor applied and extended SeaTunnel to simplify batch and streaming ingestion, unify multi‑engine processing, and reduce development and operational costs.
Background
Data integration has evolved from classic ETL/ELT to an EtLT model, where a lightweight t step standardises extracted data before loading, and a final T step performs business‑logic transformations. This model satisfies the need for high‑quality, real‑time data assets across heterogeneous sources.
Apache SeaTunnel Overview
Apache SeaTunnel (formerly SeaTunnel) is an open‑source, distributed EtLT framework that supports massive batch and real‑time synchronization. A SeaTunnel job consists of:
Source connectors – parallel data extraction.
Transform connectors – optional data processing.
Sink connectors – writing to target systems.
Key capabilities:
More than 100 built‑in connectors (e.g., ClickHouse, Doris, HDFS, FTP/SFTP, Hive).
Batch‑and‑stream unity: the same job definition can run in pure streaming (Flink) or micro‑batch (Spark) mode.
Multi‑engine support: Flink, Spark, and SeaTunnel’s native Zeta engine can be swapped without code changes.
Parallel connector execution, distributed snapshots, two‑phase commit and idempotent writes provide high throughput and exactly‑once semantics.
Integration in AISWare DataOS
DataOS originally combined DataX (single‑node batch) with Flink CDC and Filebeat (streaming), leading to two major issues:
DataX required custom distributed scheduling because it runs on a single node.
Maintaining three technology stacks (DataX, Spark, Flink CDC) increased development and operational complexity.
To resolve these problems the team introduced SeaTunnel and performed the following actions:
Cancel resource allocation – SeaTunnel’s native distributed execution removed the need for a separate resource‑allocation layer for DataX.
Technology‑stack replacement – Replaced DataX and Spark batch processing with SeaTunnel’s Zeta engine; replaced Flink CDC with SeaTunnel’s native streaming connectors.
Componentised SeaTunnel connectors – Built a visual DAG editor that converts front‑end form inputs into SeaTunnel JSON job definitions, improving readability and usability.
Log monitoring integration – Added client‑side listeners to capture task status, data volume, QPS and other metrics for full observability.
Engine‑mix development – Enabled mixed‑engine DAGs so a single job can combine SQL‑engine and DP‑engine tasks.
Connector Optimisations and Extensions
Hive connector – Switched metadata retrieval from MetaURL to HiveServer2 JDBC, eliminating security constraints.
HDFS connector – Added recursive directory scanning, regex file matching, and support for RC, Sequence, XML and JSON formats.
FTP/SFTP connectors – Fixed I/O leaks, improved connection‑caching, and ensured credential isolation per IP.
Custom database support – Implemented read/write for the domestic HanGao database, including row‑to‑column transformations and UDFs for data masking.
Two‑phase commit for file writes – Enforced 777 permissions on temporary directories and required the temporary and target directories to reside on the same filesystem to avoid rename failures.
Version Management and Community Collaboration
The team follows a branch‑based workflow:
Create a local branch based on the current SeaTunnel release.
Periodically merge upstream changes to stay up‑to‑date.
Contribute custom enhancements (connectors, bug fixes) back to the open‑source community, reducing long‑term maintenance effort.
Results and Benefits
Unified multi‑engine capability – Batch and streaming processing are consolidated under a single platform, lowering the learning curve for business users.
Simplified resource management – Manual distribution of DataX tasks is eliminated, improving scheduling efficiency and stability.
Reduced R&D and O&M costs – A single, componentised architecture reduces the need for specialised expertise across multiple stacks.
The platform now supports seamless scaling, easier maintenance, and faster onboarding of new data‑integration scenarios.
Conclusion
EtLT has become the de‑facto architecture for modern data integration. As enterprises adopt hybrid‑cloud, AI‑driven analytics and emerging concepts such as Zero‑ETL, DataFabric and data‑virtualisation, continuous innovation in frameworks like SeaTunnel—combined with open‑source collaboration—will drive the next wave of integration capabilities.
AsiaInfo Technology: New Tech Exploration
AsiaInfo's cutting‑edge ICT viewpoints and industry insights, featuring its latest technology and product case studies.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
