Fine‑Grained Resource Management in Apache Flink: Scenarios, Mechanism, Efficiency, Allocation Strategies, and Limitations
This article explains Apache Flink's fine‑grained resource management, describing typical use cases, the slot‑based mechanism, how it improves resource efficiency, the default allocation strategy, current limitations, and provides example code for configuring slot sharing groups.
Apache Flink strives to automatically export reasonable default resource requirements for all ready‑made applications, while allowing users to fine‑tune resource consumption for specific scenarios through fine‑grained resource management.
1. Typical Scenarios That May Benefit from Fine‑Grained Resource Management
1) Tasks have significantly different parallelism. 2) The entire pipeline requires more resources than can fit into a single slot/TaskManager. 3) Batch jobs where different stages have markedly different resource needs.
2. How It Works
As described in the Flink architecture, tasks in a TaskManager are executed on multiple slots, which are the basic units for resource scheduling and demand.
With fine‑grained management, a slot request can include a user‑specified resource profile. Flink respects the requested profile and dynamically allocates a fully matching slot from the TaskManager's available resources. For example, a slot requiring 0.25 CPU cores and 1 GB memory is allocated as slot 1.
Previously, resource demands only specified the number of slots without a detailed profile (coarse‑grained management). The TaskManager had a fixed number of identical slots to satisfy those demands.
If no resource profile is specified, Flink automatically determines one based on the TaskManager's total resources and the configured number of task slots, similar to coarse‑grained management. In the example, a TaskManager with 1 CPU core and 4 GB memory and 2 slots creates slot 2 with 0.5 CPU and 2 GB memory for unspecified requests.
After allocating slot 1 and slot 2, the TaskManager retains 0.25 CPU core and 1 GB memory as free resources, which can be further partitioned to satisfy additional demands.
For more details, refer to the resource allocation strategy documentation.
3. How It Improves Resource Efficiency
Previously, Flink used coarse‑grained management, deploying tasks to predefined, usually identical slots without knowing the exact resources each slot contained. For many jobs, this was sufficient, but it could lead to under‑utilization when tasks had varying parallelism or resource needs.
Fine‑grained management allows slots with different resource specifications, enabling better utilization, especially when tasks' consumption varies over time (peak‑shaving) or when expensive resources like GPUs are involved.
4. Resource Allocation Strategy
This section discusses Flink's slot partitioning mechanism and allocation strategy, including how Flink selects a TaskManager to carve out a slot and how it allocates TaskManagers on native Kubernetes or threads. The strategy is pluggable; the default implementation is described here.
When a slot request arrives (e.g., 0.25 CPU and 1 GB memory), Flink scans registered TaskManagers and picks the first one with sufficient free resources, creates a new slot with the requested profile, and returns resources to the TaskManager when the slot is released.
Current strategy may lead to resource fragmentation. For example, two slot requests each need 3 GB heap memory while a TaskManager has only 4 GB; Flink starts two TaskManagers, each wasting 1 GB. Future strategies may allocate heterogeneous TaskManagers to reduce fragmentation.
Users must ensure that the total resources configured for a slot sharing group do not exceed the TaskManager's total resources, otherwise jobs will fail unexpectedly.
5. Limitations
Fine‑grained resource management is an experimental feature, and not all functionalities of the default scheduler are supported yet. Current limitations include:
No support for elastic scaling; elastic scaling only works for slots without specified resources.
TaskManager redundancy configuration (slotmanager.redundant‑taskmanager‑num) is ignored.
Uniform slot distribution strategy is not supported.
Web UI integration is limited; the UI shows only slot IDs, not detailed resource specs.
Limited integration with batch jobs; fine‑grained management requires all edge types to be blocked (fine‑grained.shuffle‑mode.all‑blocking=true).
Mixed resource demands are discouraged; unspecified slots may receive inconsistent resources across executions.
Slot allocation may not be optimal because the problem is multi‑dimensional bin packing (NP‑hard).
Configuration key:
cluster.fine-grained-resource-management.enabled final StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
SlotSharingGroup ssgA = SlotSharingGroup.newBuilder("a")
.setCpuCores(1.0)
.setTaskHeapMemoryMB(100)
.build();
SlotSharingGroup ssgB = SlotSharingGroup.newBuilder("b")
.setCpuCores(0.5)
.setTaskHeapMemoryMB(100)
.build();
someStream.filter(...).slotSharingGroup("a") // Set the slot sharing group with name “a”.
.map(...).slotSharingGroup(ssgB); // Directly set the slot sharing group with name and resource.
env.registerSlotSharingGroup(ssgA); // Then register the resource of group “a”For additional resources, see the linked articles and images embedded in the original content.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Big Data Technology & Architecture
Wang Zhiwu, a big data expert, dedicated to sharing big data technology.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
