Big Data 15 min read

Why Kafka Falls Short for Real‑Time Analytics and How Fluss Changes the Game

Flink Forward Asia 2024 highlighted the limitations of Kafka for real‑time analytics—lack of updates, poor data exploration, costly back‑tracking, and high network overhead—while introducing Fluss, a columnar streaming storage that offers low‑latency reads, CDC, lake‑stream integration, and efficient Delta Join for scalable, fast analytics.

Alibaba Cloud Big Data AI Platform
Alibaba Cloud Big Data AI Platform
Alibaba Cloud Big Data AI Platform
Why Kafka Falls Short for Real‑Time Analytics and How Fluss Changes the Game

This article, based on a talk by WU Chong (cloudx) at Flink Forward Asia 2024, introduces Fluss, a next‑generation storage solution designed for streaming analytics, and describes its open‑source release.

1. Problems of Kafka in Real‑Time Analytics

Kafka does not support data updates, forcing duplicate records to be stored and requiring expensive deduplication in Flink, which consumes large state and resources. It also lacks data exploration capabilities, offering no direct query interface, leading to costly synchronization with OLAP systems or inefficient full‑scan queries via engines like Trino. Long‑term data back‑tracking is limited by storage cost and performance, and network costs are high because consumers often read all columns even when only a subset is needed.

2. Fluss: Flink Unified Streaming Storage

Fluss fills the market gap of a streaming storage optimized for analytical workloads by using a columnar format based on Apache Arrow. It provides efficient column pruning, real‑time updates via a log‑tablet with KV index, CDC support, and seamless point‑lookup queries.

3. Core Features of Fluss

Columnar streaming storage with Arrow‑based IPC format, enabling server‑side column pruning and up to 10× higher read throughput when most columns are skipped.

Real‑time updates and CDC via a log‑tablet backed by a RocksDB LSM tree, allowing efficient KV point‑lookups and eliminating the need for deduplication in Flink.

Lake‑stream integration: data is stored both as a real‑time stream and as lake storage (Parquet), automatically compacted and kept metadata‑consistent, enabling seamless back‑tracking and historical queries.

Union Read: combines lake storage for historical data with stream storage for low‑latency recent data, providing second‑level freshness for Lakehouse analytics.

4. Delta Join Powered by Fluss

By leveraging Fluss’s CDC stream and KV index, a new Delta Join operator replaces traditional stateful double‑stream joins. It performs point‑lookups on the opposite side, eliminating large state, reducing resource usage by up to 10×, and speeding up back‑tracking from hours to minutes.

5. Future Roadmap

Kafka protocol compatibility to ease migration.

Deep integration with Flink for storage‑optimizer‑engine co‑optimization.

Providing a real‑time layer for Paimon, completing the lake‑stream unified architecture.

6. Open‑Source Release

Fluss was officially open‑sourced on GitHub (https://github.com/alibaba/fluss) under the Apache 2.0 license during the Flink Forward Asia 2024 keynote, with plans to donate it to the Apache Software Foundation in 2025.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

Big DataFlinkReal-time analyticsKafkaDelta JoinStreaming Storage
Alibaba Cloud Big Data AI Platform
Written by

Alibaba Cloud Big Data AI Platform

The Alibaba Cloud Big Data AI Platform builds on Alibaba’s leading cloud infrastructure, big‑data and AI engineering capabilities, scenario algorithms, and extensive industry experience to offer enterprises and developers a one‑stop, cloud‑native big‑data and AI capability suite. It boosts AI development efficiency, enables large‑scale AI deployment across industries, and drives business value.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.