Design and Implementation of a New Tiered Resource Guarantee System for Elastic Cloud Containers
The new tiered resource‑guarantee system for Didi’s elastic cloud containers defines S, A, and B priority levels with explicit over‑commit rules, upgrades OS, Kubernetes, kube‑odin, service‑tree, and CMP components, and thereby cuts CPU contention by up to 80%, reduces latency, improves scaling reliability, and lowers operational costs.
In the previous two articles we explained the implementation process of elastic cloud mixing technology and the Kubernetes scheduling strategy. This article delves into the construction of a new tiered container guarantee system to help readers better understand the practical results of Didi’s elastic cloud.
During peak travel periods, the surge in demand puts huge pressure on stability. When traffic spikes, CPU utilization on individual physical machines rises sharply, containers compete for resources, and both micro‑level (single‑machine) and macro‑level (cluster) contention increase, leading to higher failure rates for container scheduling and scaling.
Under the premise of cost reduction and without adding new compute servers, the elastic cloud needed a systematic way to ensure resource supply and container stability in high‑pressure scenarios. A new tiered guarantee system was introduced, defining explicit resource guarantee levels for containers based on their priority.
The early tiered system only provided three levels (1/2/3) that simply distinguished container priority without linking it to underlying resource guarantees. This caused several problems:
Severe resource contention
High business latency and frequent spikes
High scaling‑failure probability
Inaccurate capacity estimation
Difficulty evaluating the number of physical machines needed
These issues stem from the lack of a concrete resource guarantee. At the single‑machine level, the total specifications of containers often exceed the physical machine’s capacity (over‑commit). At the cluster level, the sum of service quotas exceeds the total physical resources.
Two solution ideas were explored:
Ensuring that the entire core‑service chain does not over‑commit – proved infeasible because core services already consume most of the CPU.
Ensuring that the most important services in the core‑service chain do not over‑commit, while defining clear over‑commit rules for the remaining services. The resulting rules are:
With these rules, the new system achieved four major benefits:
CPU contention for S‑level services reduced by 60‑80%, for A‑level by 30‑50%.
99th‑percentile latency (average) decreased by 7‑20%, maximum latency decreased by 5‑15%.
Peak‑time capacity for S‑level services increased by over 30%, while S‑level services could shrink resources by over 30%.
Overall business metrics improved: CPU contention dropped 65‑75%, spikes almost disappeared, latency fell 5‑25%, and stress‑test contention reduced 60‑70%.
To support the new guarantee system, the elastic cloud upgraded components from the bottom up, covering the operating‑system layer, Kubernetes scheduling layer, kube‑odin layer, service tree, and the CMP system.
Operating‑System Layer : Introduced priority‑weighted scheduling, CPU burst for short‑term over‑use, priority‑based memory reclamation, tiered watermarks, and bandwidth throttling for low‑priority containers.
In user space, kubelet, IRMAS and a new single‑machine scheduler were enhanced to recognize tiered containers, collect and report their information, and perform extreme‑case resource compression and recovery in coordination with Kubernetes.
Kubernetes Scheduling Layer : Added a minimum‑resource‑guarantee policy and a tiered‑balanced scattering strategy to avoid placing too many containers of the same tier on a single physical machine.
Kube‑odin Layer : Provided new tier‑related APIs, billing support for tiered containers, and extended quota operation interfaces.
Service Tree : Added support for tiered cluster information while maintaining compatibility.
CMP System : Managed quota cost accounts for S/A/B levels, introduced quota control and application modules to standardize quota requests and link them to physical resources.
The new guarantee focuses on three key resources:
Per‑machine CPU guarantee : After applying the over‑commit ratio, the required physical CPU must not exceed the machine’s capacity (e.g., a 40‑core machine provides 36 usable cores; total requested cores after over‑commit equal 36, satisfying the guarantee).
Per‑machine memory guarantee : Implemented tiered watermarks and priority‑based reclamation so that low‑priority containers are reclaimed first, protecting high‑priority containers.
Cluster‑level quota guarantee : Quota reflects the total resource request of a service; it is linked to physical machine resources, enabling accurate provisioning and effective resource control.
In summary, the legacy tiered system suffered from resource contention, latency spikes, scaling failures, inaccurate capacity planning, and difficulty estimating physical machine needs. The new tiered system introduces clear over‑commit rules, guarantees resources at both the machine and cluster levels, and adds comprehensive strategies for container scheduling, runtime protection, and resource control, thereby improving stability and reducing operational costs.
Didi Tech
Official Didi technology account
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.