Big Data 13 min read

Building a Millisecond-Responsive Real-Time Data Engine with StarRocks, Fluss, and Paimon

This article presents a lake‑stream integrated solution that combines Apache Fluss, Apache Paimon, and StarRocks to achieve second‑level data freshness, tenfold storage cost reduction, and a single‑query access pattern for both real‑time and historical data, detailing its architecture, advantages, query modes, and future roadmap.

StarRocks
StarRocks
StarRocks
Building a Millisecond-Responsive Real-Time Data Engine with StarRocks, Fluss, and Paimon

1. Background and Challenges

Traditional Lambda architecture uses Kafka + Flink for the real‑time path and Hive + Spark for the batch path, resulting in three core pain points:

Storage duplication: The same business data is kept for 7 days in Kafka and again in Hive, causing storage costs to multiply.

Code duplication: Separate streaming and batch pipelines often diverge, leading to inconsistent data definitions and extensive debugging effort.

Freshness limitation: Offline jobs refresh only at T+1 or hourly granularity, which cannot meet real‑time business requirements.

2. Limitations of Pure Lakehouse

Even a pure lakehouse architecture such as Flink + Paimon still binds data freshness to Flink checkpoints. If the checkpoint interval is 5 minutes, the first‑level lake table is refreshed every 5 minutes, the second level after 10 minutes, and so on, causing linear latency growth for multi‑stage processing.

3. Core Advantages of the Lake‑Stream Integrated Solution

3.1 Ten‑fold Storage Cost Reduction

In the Lambda architecture, Kafka typically retains data for 7 days to guarantee replay capability. In the lake‑stream solution, Fluss only keeps ultra‑short‑term data (e.g., 6 hours); data older than the TTL is automatically tiered to Paimon lake tables. This shrinks storage from 7 days to 6 hours, reducing cost by an order of magnitude while unifying stream‑batch storage into a single view.

3.2 Freshness Independent of Layer Count

Fluss + Paimon keeps the real‑time segment at second‑level latency. The long‑term lake ingestion is decoupled from Flink checkpoints, so each lake table maintains a stable freshness of about 3 minutes regardless of how many processing layers exist. In short, the real‑time segment is seconds‑level, the lake segment ~3 minutes, and freshness does not accumulate with additional layers.

3.3 Union Read for Millisecond Freshness

Union Read is the core query capability of the integrated architecture. When a user issues a normal SELECT, StarRocks simultaneously reads the Paimon snapshot (historical data) and the Fluss incremental log (real‑time data starting from the snapshot's log offset). It then performs a sort‑merge on primary keys, delivering a complete result set with exactly‑once semantics—one query, real‑time + historical data.

4. Architecture Overview

The design follows the “three‑same‑one” principle:

Data same‑one: No double‑write; Fluss data older than the TTL is automatically sunk into Paimon by the Tiering Service, forming a single data flow.

Metadata same‑one: DLF Omni Catalog manages metadata for both Fluss and Paimon, providing a unified catalog for all assets.

Query entry same‑one: StarRocks serves as the sole query engine; a single SQL can retrieve both real‑time and historical data.

5. Tiering Service – Automatic Data Layering

Tiering Service is a long‑running Flink job that moves data whose TTL (e.g., 6 hours) has expired from Fluss into Paimon lake tables and deletes the expired records. Users only need to enable a configuration switch; no custom ETL scripts are required.

6. StarRocks Unified Query Entry

Creating the Fluss catalog requires a single SQL statement:

CREATE EXTERNAL CATALOG `fluss_catalog`
PROPERTIES (
  "type" = "fluss",
  "fluss.option.client.security.sasl.mechanism" = "PLAIN",
  "bootstrap.servers" = "fluss-cn-2rn4ffq4o01:9123",
  "fluss.option.client.security.sasl.password" = "xxx",
  "fluss.option.client.security.protocol" = "SASL",
  "fluss.option.client.security.sasl.username" = "xxx"
);

Query Modes

Default Union Read: SELECT reads both Fluss real‑time data and Paimon historical data, suitable for dashboards and real‑time reports.

$lake suffix: Adding $lake to the table name queries only the Paimon lake (historical) segment, ideal for T+1 reports and cross‑day recomputation.

$rt suffix: Adding $rt queries only the Fluss real‑time segment, useful for online troubleshooting and monitoring.

7. Technical Architecture Details

Paimon Scan (Historical Segment)

Historical queries use a native C++ direct‑read path, incurring zero JVM overhead. The engine supports columnar high‑throughput, vectorized execution, and can leverage DataCache. Data is read directly from OSS‑stored Paimon files, fully exploiting StarRocks' lake‑query performance.

Fluss Scan (Real‑time Segment)

Real‑time queries invoke the Fluss Java client via a JNI bridge, transmitting data in Arrow format. Although a JVM layer remains, Arrow columnar transfer dramatically reduces row‑column conversion costs.

Union Read Merge

When a default query is executed, StarRocks reads Paimon snapshot N and Fluss logs with offset > N.commit, merges them by primary key, and guarantees exactly‑once semantics. This mechanism realizes “one query, real‑time + historical data all at once”.

8. Future Roadmap

Union Read 2.0 – Skip Sort Merge: Support Fluss Delete Vector (bitmap deletion) to bypass the sort‑merge step, significantly boosting end‑to‑end performance for large primary‑key tables.

Optimizer Enhancements: Ingest Fluss statistics (row count, NDV, min/max) to enable metadata‑driven optimizations such as COUNT push‑down, LIMIT push‑down, and Time‑Travel queries.

Native Full‑Chain: Extend the native C++ scanner to the real‑time segment, using Fluss Arrow for zero‑copy transfer, integrating DataCache and predicate push‑down to Fluss servers, thereby eliminating JVM usage entirely.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

real-time analyticsStarRocksPaimonLakehousecost reductionFlussUnion Read
StarRocks
Written by

StarRocks

StarRocks is an open‑source project under the Linux Foundation, focused on building a high‑performance, scalable analytical database that enables enterprises to create an efficient, unified lake‑house paradigm. It is widely used across many industries worldwide, helping numerous companies enhance their data analytics capabilities.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.