Big Data 12 min read

How Fluss Unifies Stream and Lake to Power AI Data Pipelines

In the era of rapid AI growth, Fluss offers a unified lake‑stream architecture that tackles data quality, timeliness, scale, and multimodal challenges by tightly integrating Flink streaming with a high‑performance data lake, enabling seamless real‑time and batch analytics for AI workloads.

DataFunSummit

Jul 12, 2025

How Fluss Unifies Stream and Lake to Power AI Data Pipelines

Fluss Overview

导读 In the fast‑evolving AI landscape, data infrastructure faces challenges in quality, timeliness, scale, and multimodal processing. To address these, Alibaba Cloud’s Flink team introduced Fluss, a lake‑stream integrated architecture designed to build an efficient, unified data processing platform. This talk covers Fluss’s core design concepts, architectural advantages, real‑world scenarios, and future roadmap, detailing how deep integration of stream storage and data lake enables seamless real‑time processing and offline analysis for AI applications.

Agenda

Fluss Introduction

Fluss Lake‑Stream Architecture Design

Fluss Lake‑Stream Best Practices

Future Technical Planning and Community Ecosystem Outlook

Summary

AI Data Infrastructure Requirements

AI’s rapid development raises higher‑order data needs:

1) Data quality – accurate, complete metadata for model training.

2) Data timeliness – support for both historical and real‑time data (e.g., RAG, real‑time user profiling).

3) Data scale – massive datasets for large‑model training (trillions of parameters require billions of samples).

4) Multimodal data – efficient handling of images, text, audio for generative AI.

Fluss Positioning and Core Features

Flink, as a powerful compute engine, requires a tightly coupled storage system. Fluss (Flink Unified Streaming Storage) is a stream storage system built for real‑time analytics, offering low latency, high throughput, and seamless integration with data lakes.

Key capabilities:

1) Stream‑Batch Integration – real‑time read/write with millisecond latency; batch read/write with native archiving to the data lake, reducing long‑term storage costs.

2) Full‑Incremental Integration – automatic switching from OSS offline reads to real‑time reads, supporting column pruning and predicate push‑down to lower I/O overhead.

3) Lake‑Stream Integration – stream storage holds hot data (hours), lake storage holds cold data (days); built‑in lake‑stream channel services enable seamless archiving (e.g., Flink jobs automatically sync to Paimon).

4) Enhanced Features – schema management, real‑time CDC, data profiling, key‑point queries, supporting complex analytics such as dimension‑table joins and real‑time data‑warehouse construction.

Why Lake‑Stream Integration?

Traditional architectures suffer from:

Two separate storage systems (real‑time vs. offline) leading to high costs (Kafka storage for long‑term data can be >10× OSS).

Inconsistent computation results across the two pipelines.

Higher development and operation overhead due to maintaining dual ETL pipelines.

Industry trends (Kafka, Redpanda, AutoMQ) are moving toward lake‑stream designs to achieve “one data, multiple uses.”

Core Architecture Design

Metadata Federation – Flink SQL creates Fluss tables that automatically sync to the data lake (e.g., Paimon), ensuring consistent schema and primary keys.

Example: creating a Fluss table user_orders automatically generates a corresponding Paimon table, enabling Union Read for seamless full‑incremental data access.

Tiering Service – a stateless service moves hot data to the lake and reports cold data locations, supporting multiple lake formats (Paimon, Iceberg, Hudi) and compute engines (Flink, Spark, StarRocks).

Business Benefits

Cost Optimization – hot data stored for ~1 hour reduces storage cost by >90 %; cold data in the lake costs ~1/10 of stream storage.

Efficiency Gains – unified metadata and pipelines eliminate result inconsistencies, reducing offline‑to‑online reporting gaps.

Best Practices

Quick Start – configure the data‑lake address (e.g., Paimon) in Fluss config, then launch the tiering service with a one‑click script; Docker images are provided for fast deployment (≈10 minutes).

Typical Scenarios

Full‑incremental analysis: real‑time aggregation of total consumption for top‑10 customers with second‑level latency.

Data‑lake analysis: use Flink or other engines to query lake data for ad‑hoc BI.

AI Agent: real‑time sentiment analysis of user comments triggers instant coupon distribution, improving response time from minutes to seconds.

Future Technical Planning & Community Outlook

Planned enhancements include multi‑lake and multi‑engine compatibility (Iceberg, Hudi, Presto, Trino) and Kafka protocol compatibility for seamless migration. Deep integration with Flink SQL will enable predicate push‑down to storage (e.g., filtering by order_time > '20250401' directly in the storage layer).

The community aims to consolidate production best practices from large‑scale internal deployments (ByteDance, Taobao, Alibaba Cloud) and foster open‑source contributions from companies like Tencent, Kuaishou, Ant, JD, eBay, and more.

Conclusion

Fluss delivers an innovative lake‑stream solution that resolves cost, latency, and consistency challenges in data processing, providing a solid foundation for AI model training and real‑time applications. With expanding multi‑lake, multi‑engine support and a growing community ecosystem, Fluss is poised to become a key data system for Data + AI, unlocking enterprise data value.

Flink AI Real-time analytics Streaming data lake Fluss

Written by

DataFunSummit

Official account of the DataFun community, dedicated to sharing big data and AI industry summit news and speaker talks, with regular downloadable resource packs.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.