Big Data 18 min read

Evolution of Flink State Storage and Compute‑Storage Separation Architecture

This article examines the evolution of Flink's state storage, discusses challenges posed by cloud‑native deployments, reviews recent community and Alibaba enhancements such as unaligned checkpoints, incremental snapshots, and the Gemini layered storage system, and proposes future directions for a compute‑storage separation architecture.

Big Data Technology & Architecture

Mar 4, 2024

Evolution of Flink State Storage and Compute‑Storage Separation Architecture

Flink's state storage is a core component of stream processing. Since the seminal VLDB paper in 2017, the overall architecture has remained stable, but deployment models, storage media, and job workloads have dramatically changed, prompting a re‑evaluation of state management.

From the early MapReduce era without resource isolation to the modern cloud‑native era with Kubernetes containers, hardware advances (e.g., 25 Gbps network bandwidth) have shifted storage from local disks to distributed and cloud storage, offering larger capacity and lower cost at the expense of higher latency.

Job workloads have also grown: what was once a few hundred megabytes of state is now often measured in terabytes, especially in logistics and other data‑intensive scenarios. These trends motivate the three‑part discussion of Flink's compute‑storage separation evolution.

Why state is crucial for Flink performance, as state access lies on the critical path of each record.

Recent community and Alibaba efforts on state storage, including unaligned checkpoints, incremental snapshots, and the Gemini layered storage system.

Future directions for a compute‑storage separation architecture.

Why State Is So Important for Flink

In Flink streaming, operators that need to reference earlier inputs store intermediate data in a State Table. The StateBackend provides read/write services; when state exceeds memory, it is spilled to local disk and periodically snapshotted to distributed file systems (OSS/HDFS/S3). Consistent global checkpoints are required across all tasks.

State Management Requirements and Existing Problems

State access is single‑threaded and single‑KV, making latency highly sensitive. Local disk access can be microsecond‑level, but remote DFS introduces significant bottlenecks, especially for large state sizes. Checkpointing must remain lightweight to avoid impacting normal processing.

Cloud‑native constraints add further challenges: limited local disk size in containers, difficulty remounting disks at runtime, and the need for smooth elasticity. Existing solutions like Queryable State are deprecated due to performance impact.

State Storage Improvements – Community and Commercial Efforts

Distributed Snapshot Architecture Upgrade

Since Flink 1.11, six releases have introduced Buffer Debloating, Unaligned Checkpoint, and generic incremental checkpoint. These features reduce checkpoint latency under backpressure (up to 90 % reduction) and keep checkpoint size and recovery time negligible.

Incremental checkpoints decouple checkpointing from state snapshotting, uploading only changelog data to DFS, enabling checkpoints to complete within ~10 s regardless of state size and reducing CPU/network spikes by ~70 %.

Efficient Elastic Scaling for Cloud‑Native

Flink 1.18 adds an Adaptive Scheduler REST API, allowing K8s autoscalers to perform in‑place scaling without restarting JobManager or TaskManager. Parallel state file download reduces download time by ~30 %, and upcoming 1.19 Local Rescale will eliminate redundant downloads.

State reconstruction improvements disable WAL in 1.18 and support file‑level merge‑cut in 1.19, cutting rebuild time by 50‑80 %.

Gemini: Layered State Storage for Streaming

Gemini, an Alibaba‑internal layered storage system, combines an in‑memory MemTable, a cache, local disk, and DFS. It treats DFS as the source of truth, using local disk as optional cache, enabling lazy loading of remote files and asynchronous file‑level merge‑cut, reducing job stop time from 200 s to 20 s.

State Management Compute‑Storage Separation Architecture – Evolution and Challenges

Cloud‑Native Architecture Evolution

Current Gemini design uses local disk as primary storage and DFS as secondary, which complicates file management and scaling. A proposed redesign makes DFS the primary storage and local disk an optional cache, simplifying checkpoint speed, enabling state sharing across tasks, and allowing remote compaction and load‑balancing.

However, moving all state to remote storage would degrade performance due to higher access latency.

Flink's Existing State Access Thread Model

State access follows a single‑thread, single‑record model, making latency a critical factor. Reads that miss the block cache block on disk I/O, especially when accessing DFS, cause significant performance penalties. A shift to non‑blocking I/O and a larger thread pool can improve throughput by up to 6×, though ordering guarantees for the same key must be preserved.

Summary

State access is vital for Flink's throughput; cloud‑native environments impose stricter storage requirements; community features like unaligned and incremental checkpoints are widely adopted; Gemini demonstrates a layered storage approach; and future work will focus on a true compute‑storage separation architecture that balances performance, elasticity, and resource efficiency.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Flink Gemini Distributed Checkpoint State Storage

Written by

Big Data Technology & Architecture

Wang Zhiwu, a big data expert, dedicated to sharing big data technology.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.