How InLong Guarantees Exactly‑Once Real‑Time Writes to StarRocks
This article explains how Apache InLong provides automatic, secure, high‑performance real‑time data transfer to StarRocks, detailing the transactional Stream Load API, the two‑phase commit process, Flink‑based ingestion architecture, exactly‑once guarantees, and performance test results across different parallelism levels.
Background
Apache InLong (also known as YingLong) is a one‑stop, all‑scenario massive data integration framework that offers automatic, secure, reliable, and high‑performance data transmission. It is widely used in advertising, payment, social, gaming, AI, and other industries, handling millions of billions of rows per day.
StarRocks Data Import Capabilities
StarRocks supports four import methods: Stream Load, Broker Load, Routine Load, and Spark Load. InLong chooses the Stream Load method with transactional support to achieve real‑time ingestion.
Transactional Write Principle in StarRocks
StarRocks implements a classic two‑phase commit (2PC) protocol. The client interacts with the transaction service via the following REST‑style APIs: /api/transaction/begin: start a new transaction. /api/transaction/prepare: pre‑commit the transaction, persisting changes temporarily. /api/transaction/commit: commit the transaction and make changes permanent. /api/transaction/rollback: roll back the transaction. /api/transaction/load: load data using an existing transaction label (or a randomly generated one if none is provided).
The internal workflow for each phase is:
Begin + Load : The FE leader creates a transaction, assigns a label, and the Stream Load succeeds when the number of written replicas exceeds half of the tablet replica count.
Commit : FE marks the transaction as committed, sends a publish‑version message to BE nodes, which then make the data visible to queries.
Rollback : If load or commit fails, the transaction is aborted; BE nodes asynchronously clean up the failed data. Transactions already committed cannot be rolled back.
InLong Real‑Time Write Architecture
InLong uses Flink tasks to write data to StarRocks in real time. The workflow includes:
Multiple Flink tasks are launched according to the parallelism configuration.
Each task uses StarRocks Stream Load with a consistent transaction label for the entire Flink checkpoint cycle.
When a checkpoint starts, the task records the target table and transaction label in its state.
Upon receiving a checkpoint‑completion signal, the task commits all pending transactions, making the data visible.
Exactly‑Once Guarantees
No Duplicate Data : Flink may restart from a previous checkpoint, potentially re‑submitting data. InLong records the transaction IDs that have successfully prepared; if a transaction ID has already been committed, StarRocks returns an error and the duplicate write is ignored.
No Data Loss : If a write fails, Flink’s checkpoint mechanism restarts the task from the last successful checkpoint, and the source re‑consumes from the last committed offset, ensuring that all data eventually reaches StarRocks.
Performance Test
A benchmark using an Iceberg → StarRocks pipeline was conducted at 1 billion rows. The test measured synchronization time and throughput under three parallelism settings:
Parallelism 1: 60 minutes, 27 Mb/s.
Parallelism 3: 10 minutes, 162 Mb/s.
Parallelism 40: 2 minutes, 800 Mb/s.
Results may vary depending on Iceberg read speed, compression, and column count.
Conclusion
StarRocks resolves read‑write conflicts through version control and achieves atomic data ingestion via FE‑based transaction management. InLong leverages this mechanism to provide exactly‑once semantics for real‑time streaming workloads.
References
Stream Load Transaction Interface: https://docs.starrocks.io/docs/loading/Stream_Load_transaction_interface/
MySQL → StarRocks real‑time sync example: https://inlong.apache.org/zh-CN/docs/next/quick_start/data_sync/mysql_starrocks_example/
StarRocks
StarRocks is an open‑source project under the Linux Foundation, focused on building a high‑performance, scalable analytical database that enables enterprises to create an efficient, unified lake‑house paradigm. It is widely used across many industries worldwide, helping numerous companies enhance their data analytics capabilities.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
