Elastic Cloud Mixed Deployment: Architecture, Scheduling, Isolation, and Future Directions
Didi's Elastic Cloud uses mixed deployment to co‑locate diverse services, employing tiered guarantees, custom Kubernetes scheduling, profiling, rescheduling, and isolation‑cluster techniques to boost utilization while preserving QoS, with a roadmap for broader automation and interference detection.
Elastic Cloud is the underlying container platform that has been running Didi’s core services for more than seven years. "Mixed deployment" (混部) refers to placing different business services on the same physical or virtual machines to improve cluster resource utilization while preserving the quality of critical services.
The article focuses on the mixed‑deployment aspects of Elastic Cloud, covering its evolution, core technical capabilities, current online mixed‑deployment status, and future plans.
Definition of Mixed Deployment
Mixed deployment means deploying services with different characteristics onto the same host to increase overall resource usage and reduce total cost. It can be classified into online mixed deployment (public‑cluster online services, isolated‑cluster online services with storage) and offline mixed deployment (online services mixed with offline services).
Key Technical Challenges
How to grade services and define QoS for each level.
How to build fine‑grained service profiles to guide cluster scheduling and reduce resource contention.
Kernel‑level resource isolation (CPU, memory, I/O, cache, network) to guarantee high‑priority services.
Detection of performance interference and guidance for eviction and scheduling optimization.
Phase 1: Public‑Cluster Online Mixed Deployment
In 2017 Didi started moving workloads to the cloud, first adopting Docker and cgroup, then gradually standardizing on Kubernetes. As more services (ride‑hailing, maps, etc.) migrated, container density increased, leading to serious resource contention and latency spikes.
Elastic Cloud Tiered Guarantee System
To keep CPU usage around 50 % during peak periods, a tiered guarantee system was built, providing quota management at the cluster level and priority‑based scheduling.
Kubernetes Scheduling Support
The scheduler selects a node for a new pod through two phases: pre‑selection (Predicates) and scoring (Priorities). Custom strategies such as ActualBalancedResourceAllocation , BalancedResourceAllocation , ActualLeastResourceAllocation , LeastResourceAllocation , InterPodAffinityPriority , NodeAffinityPriority , and TaintTolerationPriority are used to balance resource usage and respect topology constraints.
Rescheduling
Because cluster resources change (scale‑in/out, machine replacement) and workload patterns evolve, a rescheduling service periodically inspects nodes and triggers the scheduler to relocate pods that no longer meet the optimal placement criteria.
CPU usage charts show that public‑cluster online services maintain about 50 % utilization, confirming the effectiveness of the tiered guarantee system.
Phase 2: Public‑Cluster Offline Mixed Deployment
Further increasing online deployment density would cause severe contention; therefore, the strategy shifts to mixing offline tasks with online services during low‑load periods.
Offline tasks are scheduled based on the remaining CPU capacity of each host after accounting for online usage. Two scaling approaches are used:
Horizontal scaling : Adjust the number of offline pods on a node according to utilization predictions.
Vertical scaling : Adjust the resource spec of a single offline pod on each node to fill the residual capacity.
Container profiling predicts the maximum online utilization for the next hour, enabling the scheduler to allocate safe offline resources. Two prediction algorithms are used: a 7‑day historical ratio (now deprecated) and a weighted‑average method that combines 7‑day, 1‑day, and 1‑hour histories, achieving higher accuracy.
Phase 3: Isolation‑Cluster Mixed Deployment
Isolation clusters (e.g., dedicated Redis, MQ, or ingress services) have very low CPU utilization, leaving large mixed‑deployment potential. However, these services are latency‑sensitive and traditionally avoid mixing.
The current focus is to mix low‑priority online services into isolation clusters to raise CPU peak utilization while preserving the quality of the primary services.
Key technical components:
Kubernetes scheduling : Custom resources (e.g., mix-mid-cpu ) encode the amount of CPU reserved for mixed services based on profiling.
Node‑level container limits : Bypass or adjust per‑node container count limits for mixed services.
Rule‑engine injection : Adapt generic scheduling rules (taint tolerations, topology spread, etc.) for isolation clusters.
Rescheduling : Ensure that mixed services do not interfere with the original isolation‑cluster workloads.
Eviction logic : Triggered by business metric anomalies, exceeding mixed‑deployment thresholds, interference detection, or manual commands. Eviction can be pod‑level, node‑level, or service‑level, with destinations prioritized as mixed‑cluster → self‑built IDC cluster → public cloud.
Future Outlook
With the steady‑state cloud migration plan, public‑cluster size may stay the same or shrink, making isolation clusters the main source of mixed resources. The roadmap includes further strengthening of cluster scheduling, service profiling, node isolation, interference detection, and anomaly perception to achieve full mixed deployment across all service types.
Didi Tech
Official Didi technology account
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.