Mastering Kubernetes Cluster Autoscaler: Real‑World Challenges & Solutions
This article explores how Volcano Engine's VKE leverages Kubernetes Cluster Autoscaler to achieve elastic scaling, detailing the component's core functions, a customer’s high‑throughput workload, four major scaling problems encountered, and practical recommendations to improve performance, reliability, and cost efficiency.
What is Cluster Autoscaler (CA)
Cluster Autoscaler automatically adjusts the size of a Kubernetes cluster by adding nodes when pending Pods cannot be scheduled and removing under‑utilized nodes when their usage falls below a threshold, thereby reducing costs and improving efficiency.
CA Workflow
Gather cluster data (nodes, pod status, pending Pods, failed nodes, etc.).
Execute scaling‑out logic.
Execute scaling‑in logic.
Finish the cycle.
Wait for a configured interval and repeat.
Customer Scenario
Multiple task types with varying CPU/GPU requirements.
Peak load exceeds 20,000 Pods.
Heavy workload runs overnight.
Long‑running jobs with variable durations.
Large container images cause long pull times.
The customer used Cluster Autoscaler to automatically add nodes for pending Pods and delete idle nodes after jobs completed, also aligning Pod resource requests with node specifications to improve packing efficiency.
Problems and Solutions
Problem 1: Low Scaling Success Rate
During large‑scale expansions, many nodes failed to initialize because cloud‑disk write speed was too slow, causing time‑outs and repeated retry loops.
Solution: Limit the number of nodes created concurrently (e.g., 100 at a time) to smooth disk I/O and improve overall success rate.
Problem 2: Large Container Images Slow Scaling
Huge images and high‑frequency pulls saturated network and disk bandwidth, extending node‑initialization time.
Solution: Use a custom system image that pre‑installs required container images, eliminating the need for post‑creation pulls and reducing end‑to‑end scaling time from >22 minutes to under 5 minutes for 500 nodes.
Problem 3: Multi‑Pool Interference Delays Scale‑In
A global cooldown timer reset by any pool’s expansion prevented idle pools from scaling in promptly.
Solution: Move the cooldown timer to a per‑pool level so each pool’s scale‑in decision is independent, dramatically cutting unnecessary resource usage.
Problem 4: Excessive Pending Pods Block Scaling
When pending Pods surged to ~18,000, CA spent excessive time in the scheduling‑prediction phase, especially with node‑affinity rules, leading to delayed scaling actions.
Findings: Pending Pod count, node‑affinity usage, estimated node count, and number of pools all increase prediction latency (up to O(n³)).
Community‑proposed mitigations: limit the maximum number of nodes considered in calculations and cap per‑pool computation time (e.g., 10 seconds).
Resource Elasticity Recommendations
Package large static container images into the cloud‑server system image for faster node provisioning.
Control the number of pending Pods to keep scaling stable and avoid overwhelming CA.
Disable autoscaling for pools that do not require elasticity to reduce unnecessary computation.
Volcano Engine Developer Services
The Volcano Engine Developer Community, Volcano Engine's TOD community, connects the platform with developers, offering cutting-edge tech content and diverse events, nurturing a vibrant developer culture, and co-building an open-source ecosystem.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
