Big Data 17 min read

How Apache SeaTunnel Redefines Data Integration for Modern Data Platforms

This article reviews the evolution of data‑integration architectures toward EtLT, explains the core capabilities of Apache SeaTunnel, and details how a Chinese data‑platform vendor applied and extended SeaTunnel to simplify batch and streaming ingestion, unify multi‑engine processing, and reduce development and operational costs.

AsiaInfo Technology: New Tech Exploration

Nov 4, 2024

How Apache SeaTunnel Redefines Data Integration for Modern Data Platforms

Background

Data integration has evolved from classic ETL/ELT to an EtLT model, where a lightweight t step standardises extracted data before loading, and a final T step performs business‑logic transformations. This model satisfies the need for high‑quality, real‑time data assets across heterogeneous sources.

Apache SeaTunnel Overview

Apache SeaTunnel (formerly SeaTunnel) is an open‑source, distributed EtLT framework that supports massive batch and real‑time synchronization. A SeaTunnel job consists of:

Source connectors – parallel data extraction.

Transform connectors – optional data processing.

Sink connectors – writing to target systems.

Key capabilities:

More than 100 built‑in connectors (e.g., ClickHouse, Doris, HDFS, FTP/SFTP, Hive).

Batch‑and‑stream unity: the same job definition can run in pure streaming (Flink) or micro‑batch (Spark) mode.

Multi‑engine support: Flink, Spark, and SeaTunnel’s native Zeta engine can be swapped without code changes.

Parallel connector execution, distributed snapshots, two‑phase commit and idempotent writes provide high throughput and exactly‑once semantics.

Integration in AISWare DataOS

DataOS originally combined DataX (single‑node batch) with Flink CDC and Filebeat (streaming), leading to two major issues:

DataX required custom distributed scheduling because it runs on a single node.

Maintaining three technology stacks (DataX, Spark, Flink CDC) increased development and operational complexity.

To resolve these problems the team introduced SeaTunnel and performed the following actions:

Cancel resource allocation – SeaTunnel’s native distributed execution removed the need for a separate resource‑allocation layer for DataX.

Technology‑stack replacement – Replaced DataX and Spark batch processing with SeaTunnel’s Zeta engine; replaced Flink CDC with SeaTunnel’s native streaming connectors.

Componentised SeaTunnel connectors – Built a visual DAG editor that converts front‑end form inputs into SeaTunnel JSON job definitions, improving readability and usability.

Log monitoring integration – Added client‑side listeners to capture task status, data volume, QPS and other metrics for full observability.

Engine‑mix development – Enabled mixed‑engine DAGs so a single job can combine SQL‑engine and DP‑engine tasks.

Connector Optimisations and Extensions

Hive connector – Switched metadata retrieval from MetaURL to HiveServer2 JDBC, eliminating security constraints.

HDFS connector – Added recursive directory scanning, regex file matching, and support for RC, Sequence, XML and JSON formats.

FTP/SFTP connectors – Fixed I/O leaks, improved connection‑caching, and ensured credential isolation per IP.

Custom database support – Implemented read/write for the domestic HanGao database, including row‑to‑column transformations and UDFs for data masking.

Two‑phase commit for file writes – Enforced 777 permissions on temporary directories and required the temporary and target directories to reside on the same filesystem to avoid rename failures.

Version Management and Community Collaboration

The team follows a branch‑based workflow:

Create a local branch based on the current SeaTunnel release.

Periodically merge upstream changes to stay up‑to‑date.

Contribute custom enhancements (connectors, bug fixes) back to the open‑source community, reducing long‑term maintenance effort.

Results and Benefits

Unified multi‑engine capability – Batch and streaming processing are consolidated under a single platform, lowering the learning curve for business users.

Simplified resource management – Manual distribution of DataX tasks is eliminated, improving scheduling efficiency and stability.

Reduced R&D and O&M costs – A single, componentised architecture reduces the need for specialised expertise across multiple stacks.

The platform now supports seamless scaling, easier maintenance, and faster onboarding of new data‑integration scenarios.

Conclusion

EtLT has become the de‑facto architecture for modern data integration. As enterprises adopt hybrid‑cloud, AI‑driven analytics and emerging concepts such as Zero‑ETL, DataFabric and data‑virtualisation, continuous innovation in frameworks like SeaTunnel—combined with open‑source collaboration—will drive the next wave of integration capabilities.

Performance optimization big data Data integration Distributed Processing Apache SeaTunnel EtLT Connector Development

Written by

AsiaInfo Technology: New Tech Exploration

AsiaInfo's cutting‑edge ICT viewpoints and industry insights, featuring its latest technology and product case studies.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.