Why Kafka Falls Short for Real‑Time Analytics and How Fluss Changes the Game
Flink Forward Asia 2024 highlighted the limitations of Kafka for real‑time analytics—lack of updates, poor data exploration, costly back‑tracking, and high network overhead—while introducing Fluss, a columnar streaming storage that offers low‑latency reads, CDC, lake‑stream integration, and efficient Delta Join for scalable, fast analytics.
This article, based on a talk by WU Chong (cloudx) at Flink Forward Asia 2024, introduces Fluss, a next‑generation storage solution designed for streaming analytics, and describes its open‑source release.
1. Problems of Kafka in Real‑Time Analytics
Kafka does not support data updates, forcing duplicate records to be stored and requiring expensive deduplication in Flink, which consumes large state and resources. It also lacks data exploration capabilities, offering no direct query interface, leading to costly synchronization with OLAP systems or inefficient full‑scan queries via engines like Trino. Long‑term data back‑tracking is limited by storage cost and performance, and network costs are high because consumers often read all columns even when only a subset is needed.
2. Fluss: Flink Unified Streaming Storage
Fluss fills the market gap of a streaming storage optimized for analytical workloads by using a columnar format based on Apache Arrow. It provides efficient column pruning, real‑time updates via a log‑tablet with KV index, CDC support, and seamless point‑lookup queries.
3. Core Features of Fluss
Columnar streaming storage with Arrow‑based IPC format, enabling server‑side column pruning and up to 10× higher read throughput when most columns are skipped.
Real‑time updates and CDC via a log‑tablet backed by a RocksDB LSM tree, allowing efficient KV point‑lookups and eliminating the need for deduplication in Flink.
Lake‑stream integration: data is stored both as a real‑time stream and as lake storage (Parquet), automatically compacted and kept metadata‑consistent, enabling seamless back‑tracking and historical queries.
Union Read: combines lake storage for historical data with stream storage for low‑latency recent data, providing second‑level freshness for Lakehouse analytics.
4. Delta Join Powered by Fluss
By leveraging Fluss’s CDC stream and KV index, a new Delta Join operator replaces traditional stateful double‑stream joins. It performs point‑lookups on the opposite side, eliminating large state, reducing resource usage by up to 10×, and speeding up back‑tracking from hours to minutes.
5. Future Roadmap
Kafka protocol compatibility to ease migration.
Deep integration with Flink for storage‑optimizer‑engine co‑optimization.
Providing a real‑time layer for Paimon, completing the lake‑stream unified architecture.
6. Open‑Source Release
Fluss was officially open‑sourced on GitHub (https://github.com/alibaba/fluss) under the Apache 2.0 license during the Flink Forward Asia 2024 keynote, with plans to donate it to the Apache Software Foundation in 2025.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Alibaba Cloud Big Data AI Platform
The Alibaba Cloud Big Data AI Platform builds on Alibaba’s leading cloud infrastructure, big‑data and AI engineering capabilities, scenario algorithms, and extensive industry experience to offer enterprises and developers a one‑stop, cloud‑native big‑data and AI capability suite. It boosts AI development efficiency, enables large‑scale AI deployment across industries, and drives business value.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
