Big Data 13 min read

How Fluss Unifies Lake and Stream for Real‑Time Analytics: Architecture, Benefits, and Future Roadmap

This article summarizes a talk by Alibaba Cloud senior engineer and Flink Committer Luo Yuxia on the challenges of separating lake and stream storage, introduces the Fluss lake‑stream unified architecture, explains its technical benefits such as second‑level data freshness, unified metadata, efficient changelog generation, and outlines future plans for broader ecosystem integration.

Alibaba Cloud Big Data AI Platform
Alibaba Cloud Big Data AI Platform
Alibaba Cloud Big Data AI Platform
How Fluss Unifies Lake and Stream for Real‑Time Analytics: Architecture, Benefits, and Future Roadmap

Abstract

This article compiles the presentation by Alibaba Cloud senior developer and Flink Committer Luo Yuxia at Flink Forward Asia 2024 Shanghai, covering four parts: the current lake‑stream split challenges, the Fluss lake‑stream unified architecture, its benefits, and future plans.

1. Challenges of Lake‑Stream Split

Traditional Lambda architecture separates offline (e.g., Hive) and real‑time (e.g., Kafka) processing, leading to two storage systems, duplicated pipelines, and long development cycles. Modern lake storage solutions such as Apache Paimon, Iceberg, and Hudi provide minute‑level freshness but still require a separate streaming layer for second‑level latency, re‑introducing the split.

Architecture complexity : two storages, two codebases, two pipelines.

Operational overhead : separate monitoring, fault‑tolerance, upgrades.

Resource waste : duplicated computation.

Data issues : consistency, governance, redundancy.

2. Fluss Lake‑Stream Unified Architecture

Fluss is a real‑time stream storage designed for analytical workloads. It writes data with millisecond latency and compacts it into standard lake formats (e.g., Paimon, Iceberg) so that external query engines can read directly from the lake files.

Fluss provides a unified catalog for Flink, exposing a single table that can read from both the stream and the lake, eliminating the need to switch catalogs.

Data distribution is aligned between Fluss and the lake: both use the same bucketing algorithm bucket_id = hash(row) % bucket_num, ensuring that each row lands in the same bucket in both systems.

Unified Metadata

Instead of separate catalogs for stream and lake, Fluss presents one catalog and one table to Flink, simplifying metadata management.

Data Distribution Alignment

Because bucket assignments are identical, compaction can write Fluss bucket data directly to the corresponding lake bucket, avoiding shuffle and guaranteeing consistency.

Stream Read (Efficient Data Back‑tracking)

Historical data resides in the lake, real‑time data in Fluss. Fluss first reads lake data for back‑tracking, then streams the latest records, leveraging lake‑side predicate push‑down, column pruning, and high compression.

Batch Read (Second‑Level Freshness)

Flink can perform a union read of Fluss and lake data, achieving near‑real‑time freshness for batch analytics.

Flink + Fluss SQL

SELECT * FROM orders

This query reads the full table via a union of stream and lake data.

SELECT COUNT(*), MAX(t), SUM(amount) FROM orders$lake

Appending $lake reads only the lake side; $lake$snapshots queries system tables.

3. Benefits of the Unified Architecture

Lake storage becomes real‑time, achieving second‑level data freshness.

All warehouse layers (ODS, DWD, ADS) see the same freshness, independent of checkpoint intervals.

More efficient changelog generation: Fluss produces second‑level changelogs, and its compaction service writes them directly in Paimon format without extra lookup overhead.

Supports multi‑writer partial updates, eliminating the single‑writer limitation of Paimon.

4. Future Plans

Union Read ecosystem : extend support beyond Flink to engines like Spark and StarRocks.

Lake ecosystem : add support for additional lake formats such as Iceberg and Hudi.

Arrow → Parquet conversion : leverage mature Arrow‑to‑Parquet tools to reduce compaction costs.

FlinkReal-time analyticsStreamingdata lakelakehouseFluss
Alibaba Cloud Big Data AI Platform
Written by

Alibaba Cloud Big Data AI Platform

The Alibaba Cloud Big Data AI Platform builds on Alibaba’s leading cloud infrastructure, big‑data and AI engineering capabilities, scenario algorithms, and extensive industry experience to offer enterprises and developers a one‑stop, cloud‑native big‑data and AI capability suite. It boosts AI development efficiency, enables large‑scale AI deployment across industries, and drives business value.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.