How Airbnb Dynamically Scales Kubernetes Clusters with Custom Autoscaler
Airbnb migrated its services to Kubernetes and, over four years, evolved from manual scaling of homogeneous clusters to heterogeneous clusters with automated scaling, introducing a custom gRPC expander for the Cluster Autoscaler that enables weighted priority and plug‑in extensibility, reducing costs and operational overhead.
Airbnb migrated almost all online services to Kubernetes, operating hundreds of clusters with thousands of nodes, and needed dynamic scaling to handle large traffic fluctuations.
Airbnb's Kubernetes clusters
The evolution occurred in three stages:
Stage 1: Homogeneous clusters with manual scaling.
Stage 2: Multiple cluster types with independent scaling.
Stage 3: Heterogeneous clusters with automated scaling.
Stage 1: Homogeneous clusters, manual scaling
Initially each service ran on dedicated machines; capacity was manually allocated and rarely reduced.
Stage 2: Multiple cluster types, independent scaling
Different workloads required distinct configurations, leading to abstract cluster types and the introduction of the Kubernetes Cluster Autoscaler, which adds nodes for pending pods and removes underutilized nodes, saving about 5% of cloud costs.
Stage 3: Heterogeneous clusters, automated scaling
With over 30 cluster types and 100 clusters, management became cumbersome; consolidating into heterogeneous clusters under a single control plane reduced testing overhead and improved utilization, enabling more sophisticated scaling strategies.
Cluster Autoscaler improvements
Custom gRPC expander
Airbnb added a new Expander component that determines which node groups to scale by simulating scheduling of pending pods and filtering groups. The default random expander was insufficient for their cost and instance‑type requirements, so they implemented a priority expander and later a weighted‑priority expander.
The solution separates the expansion logic from the core Autoscaler via a plug‑in gRPC expander consisting of a client built into the Autoscaler and an external gRPC server that returns the best options.
service Expander {
rpc BestOptions (BestOptionsRequest) returns (BestOptionsResponse)
}
message BestOptionsRequest {
repeated Option options;
map<string, k8s.io.api.core.v1.Node> nodeInfoMap;
}
message BestOptionsResponse {
repeated Option options;
}
message Option {
// ID of node to uniquely identify the nodeGroup
string nodeGroupId;
int32 nodeCount;
string debug;
repeated k8s.io.api.core.v1.Pod pod;
}The design meets three requirements: extensibility for other users, independent deployment for rapid business changes, and seamless integration with the Autoscaler ecosystem.
Since 2022 Airbnb has used this approach in production without issues, and the custom expander was upstreamed to the Cluster Autoscaler and will be available in version v1.24.0.
Conclusion
Over four years Airbnb advanced its Kubernetes cluster configuration, contributing custom Autoscaler extensions that enable cost‑aware, multi‑instance‑type scaling strategies while reducing operational overhead.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
MaGe Linux Operations
Founded in 2009, MaGe Education is a top Chinese high‑end IT training brand. Its graduates earn 12K+ RMB salaries, and the school has trained tens of thousands of students. It offers high‑pay courses in Linux cloud operations, Python full‑stack, automation, data analysis, AI, and Go high‑concurrency architecture. Thanks to quality courses and a solid reputation, it has talent partnerships with numerous internet firms.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
