Big Data 12 min read

Redesigning Apache SeaTunnel: Decoupling Source and Sink APIs for Multi‑Engine Support

The presentation details the motivations, goals, and architectural redesign of Apache SeaTunnel (Incubating) to decouple its Source and Sink APIs from underlying engines, introducing unified APIs, version‑agnostic connectors, and enhanced support for Spark and Flink in both batch and streaming scenarios.

DataFunTalk
DataFunTalk
DataFunTalk
Redesigning Apache SeaTunnel: Decoupling Source and Sink APIs for Multi‑Engine Support

The speaker, Li Zongwen, a senior engineer at Baijian Open Source, identified four major issues in Apache SeaTunnel (Incubating): duplicated connector implementations, inconsistent parameters, difficulty supporting multiple engine versions, and challenging engine upgrades.

To address these problems, the redesign focuses on decoupling SeaTunnel from specific computation engines by refactoring the Source and Sink APIs, aiming to improve developer experience and enable unified, version‑agnostic connectors.

The redesign is organized into five parts: background and motivation, goals, overall design, Source API design, and Sink API design.

Background and Motivation: SeaTunnel is tightly coupled with Spark and Flink, inheriting their configuration parameters. This coupling leads to duplicated connector code, inconsistent parameters across engines, and difficulty supporting multiple engine versions, especially for users employing Lambda architectures with both batch (Spark) and real‑time (Flink) jobs.

Goals: Implement connectors only once, provide a unified Source and Sink API, support multiple Spark/Flink versions via a translation layer, clarify parallelism and commit logic, enable real‑time CDC for whole‑database sync, and automate metadata discovery and storage.

Overall Design: Introduces a translation layer that maps SeaTunnel Source/Sink APIs to engine‑specific connectors, an execution flow that parses task parameters, resolves table schemas from a catalog, loads SeaTunnel connectors via SPI, translates them, and runs them on the chosen engine.

Source API Design: Requirements include a unified offline/real‑time API, parallel reading (e.g., per‑partition readers), dynamic shard addition, coordinated reader management for CDC scenarios, and support for a single reader handling multiple tables. The design proposes a parallel source with an enumerator for shard discovery and a boundedness flag to distinguish batch from streaming.

Sink API Design: Identifies three key needs: idempotent writes, distributed two‑phase commit, and aggregated commit for storage engines like Iceberg or Hudi. Corresponding APIs—SinkWriter, SinkCommitter, and SinkAggregatedCommitter—are defined, with the aggregated commit running with a single parallelism on the driver and writers/committers executing on workers.

Engine Adaptation: For Spark and newer Flink versions, the aggregated commit can run on the driver. For older Flink versions, the design uses ProcessFunction wrappers to achieve two‑phase commit and aggregated commit logic while ensuring single‑parallel execution for the aggregator.

The speaker concludes by inviting the community to review the draft code in the SeaTunnel repository and contribute to the project.

Big DataFlinkdata integrationSparkApache SeatunnelSink APISource API
DataFunTalk
Written by

DataFunTalk

Dedicated to sharing and discussing big data and AI technology applications, aiming to empower a million data scientists. Regularly hosts live tech talks and curates articles on big data, recommendation/search algorithms, advertising algorithms, NLP, intelligent risk control, autonomous driving, and machine learning/deep learning.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.