Evolution of Flink State Storage and Compute‑Storage Separation Architecture
This article examines the evolution of Flink's state storage, discusses challenges posed by cloud‑native deployments, reviews recent community and Alibaba enhancements such as unaligned checkpoints, incremental snapshots, and the Gemini layered storage system, and proposes future directions for a compute‑storage separation architecture.
Flink's state storage is a core component of stream processing. Since the seminal VLDB paper in 2017, the overall architecture has remained stable, but deployment models, storage media, and job workloads have dramatically changed, prompting a re‑evaluation of state management.
From the early MapReduce era without resource isolation to the modern cloud‑native era with Kubernetes containers, hardware advances (e.g., 25 Gbps network bandwidth) have shifted storage from local disks to distributed and cloud storage, offering larger capacity and lower cost at the expense of higher latency.
Job workloads have also grown: what was once a few hundred megabytes of state is now often measured in terabytes, especially in logistics and other data‑intensive scenarios. These trends motivate the three‑part discussion of Flink's compute‑storage separation evolution.
Why state is crucial for Flink performance, as state access lies on the critical path of each record.
Recent community and Alibaba efforts on state storage, including unaligned checkpoints, incremental snapshots, and the Gemini layered storage system.
Future directions for a compute‑storage separation architecture.
Why State Is So Important for Flink
In Flink streaming, operators that need to reference earlier inputs store intermediate data in a State Table. The StateBackend provides read/write services; when state exceeds memory, it is spilled to local disk and periodically snapshotted to distributed file systems (OSS/HDFS/S3). Consistent global checkpoints are required across all tasks.
State Management Requirements and Existing Problems
State access is single‑threaded and single‑KV, making latency highly sensitive. Local disk access can be microsecond‑level, but remote DFS introduces significant bottlenecks, especially for large state sizes. Checkpointing must remain lightweight to avoid impacting normal processing.
Cloud‑native constraints add further challenges: limited local disk size in containers, difficulty remounting disks at runtime, and the need for smooth elasticity. Existing solutions like Queryable State are deprecated due to performance impact.
State Storage Improvements – Community and Commercial Efforts
Distributed Snapshot Architecture Upgrade
Since Flink 1.11, six releases have introduced Buffer Debloating, Unaligned Checkpoint, and generic incremental checkpoint. These features reduce checkpoint latency under backpressure (up to 90 % reduction) and keep checkpoint size and recovery time negligible.
Incremental checkpoints decouple checkpointing from state snapshotting, uploading only changelog data to DFS, enabling checkpoints to complete within ~10 s regardless of state size and reducing CPU/network spikes by ~70 %.
Efficient Elastic Scaling for Cloud‑Native
Flink 1.18 adds an Adaptive Scheduler REST API, allowing K8s autoscalers to perform in‑place scaling without restarting JobManager or TaskManager. Parallel state file download reduces download time by ~30 %, and upcoming 1.19 Local Rescale will eliminate redundant downloads.
State reconstruction improvements disable WAL in 1.18 and support file‑level merge‑cut in 1.19, cutting rebuild time by 50‑80 %.
Gemini: Layered State Storage for Streaming
Gemini, an Alibaba‑internal layered storage system, combines an in‑memory MemTable, a cache, local disk, and DFS. It treats DFS as the source of truth, using local disk as optional cache, enabling lazy loading of remote files and asynchronous file‑level merge‑cut, reducing job stop time from 200 s to 20 s.
State Management Compute‑Storage Separation Architecture – Evolution and Challenges
Cloud‑Native Architecture Evolution
Current Gemini design uses local disk as primary storage and DFS as secondary, which complicates file management and scaling. A proposed redesign makes DFS the primary storage and local disk an optional cache, simplifying checkpoint speed, enabling state sharing across tasks, and allowing remote compaction and load‑balancing.
However, moving all state to remote storage would degrade performance due to higher access latency.
Flink's Existing State Access Thread Model
State access follows a single‑thread, single‑record model, making latency a critical factor. Reads that miss the block cache block on disk I/O, especially when accessing DFS, cause significant performance penalties. A shift to non‑blocking I/O and a larger thread pool can improve throughput by up to 6×, though ordering guarantees for the same key must be preserved.
Summary
State access is vital for Flink's throughput; cloud‑native environments impose stricter storage requirements; community features like unaligned and incremental checkpoints are widely adopted; Gemini demonstrates a layered storage approach; and future work will focus on a true compute‑storage separation architecture that balances performance, elasticity, and resource efficiency.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Big Data Technology & Architecture
Wang Zhiwu, a big data expert, dedicated to sharing big data technology.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
