How SeaTunnel’s StarRocks Connector Enables High‑Performance Data Sync
This article explains SeaTunnel’s architecture and its StarRocks connector, detailing source and sink features such as field projection, predicate push‑down, parallel reading, state recovery, data type mapping, Stream Load writes, CDC support, configuration examples, and future roadmap for exactly‑once semantics.
SeaTunnel Overview
SeaTunnel is a distributed, high‑performance, extensible data integration platform for massive offline and real‑time data synchronization and transformation. It provides abstract Source, Transform, and Sink APIs, allowing various connectors to be built.
StarRocks Connector Features
Source Features
Field projection : read only selected columns to reduce data volume.
Predicate push‑down : filter rows during scan via scan-filter to minimize data transferred.
Automatic type mapping : maps StarRocks data types to SeaTunnel internal types.
User‑defined sharding : split source data into multiple tablets via request_tablet-size.
Parallel read : reads data from multiple BE nodes concurrently.
State recovery : snapshots split info for fault‑tolerant restart, providing at‑least‑once semantics.
Batch mode : currently only batch reads are supported.
Sink Features
Uses StarRocks Stream Load API, supporting CSV and JSON formats.
Batch writes: buffers data in memory and flushes when size or count thresholds are met, or on a timed interval.
Retry logic: configurable retry count and back‑off.
Handles “too many tablet versions” errors by adjusting BE compaction settings or write thresholds.
CDC support: can ingest changelog events (INSERT, UPDATE_BEFORE, UPDATE_AFTER, DELETE) when enable_upsert_delete=true.
Reading Mechanism
Two parallel‑read strategies exist: (1) JDBC via the FE node (low throughput) and (2) a distributed design that queries the FE for tablet metadata, then reads directly from multiple BE nodes. SeaTunnel implements the second, high‑throughput approach.
The FE provides a query‑plan API returning tablet IDs and their BE locations. The connector builds split objects based on this distribution, assigns splits to readers using modulo of split IDs, and each reader scans assigned tablets via Thrift.
Data returned in Apache Arrow format is converted to SeaTunnelRow, with type conversion for varchar to Date/Timestamp/String as needed.
Data Type Mapping
A mapping between StarRocks types, Arrow types, and SeaTunnel types is provided; most common types are supported, while ARRAY, HLL, and BITMAP are not yet.
Write Mechanism
Sink writes use Stream Load; data is buffered and flushed in batches. Errors like “too many tablet versions” require tuning compaction threads or batch thresholds. Retry policies can be set.
Example Configuration
An example syncs a StarRocks table customer_1 to customer_2 using field projection, a transform to strip a prefix, and a sink with JSON format. Parallelism can be set globally or per connector.
Future Plans
The roadmap includes exactly‑once semantics via the Stream Load transaction API (StarRocks 2.4), support for additional data types (BITMAP, HLL, ARRAY), and continued reliability improvements.
StarRocks
StarRocks is an open‑source project under the Linux Foundation, focused on building a high‑performance, scalable analytical database that enables enterprises to create an efficient, unified lake‑house paradigm. It is widely used across many industries worldwide, helping numerous companies enhance their data analytics capabilities.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
