Cloud Computing 25 min read

Elastic Cloud Mixed Deployment: Architecture, Scheduling, Isolation, and Future Directions

Didi's Elastic Cloud uses mixed deployment to co‑locate diverse services, employing tiered guarantees, custom Kubernetes scheduling, profiling, rescheduling, and isolation‑cluster techniques to boost utilization while preserving QoS, with a roadmap for broader automation and interference detection.

Didi Tech

Oct 12, 2023

Elastic Cloud Mixed Deployment: Architecture, Scheduling, Isolation, and Future Directions

Elastic Cloud is the underlying container platform that has been running Didi’s core services for more than seven years. "Mixed deployment" (混部) refers to placing different business services on the same physical or virtual machines to improve cluster resource utilization while preserving the quality of critical services.

The article focuses on the mixed‑deployment aspects of Elastic Cloud, covering its evolution, core technical capabilities, current online mixed‑deployment status, and future plans.

Definition of Mixed Deployment

Mixed deployment means deploying services with different characteristics onto the same host to increase overall resource usage and reduce total cost. It can be classified into online mixed deployment (public‑cluster online services, isolated‑cluster online services with storage) and offline mixed deployment (online services mixed with offline services).

Key Technical Challenges

How to grade services and define QoS for each level.

How to build fine‑grained service profiles to guide cluster scheduling and reduce resource contention.

Kernel‑level resource isolation (CPU, memory, I/O, cache, network) to guarantee high‑priority services.

Detection of performance interference and guidance for eviction and scheduling optimization.

Phase 1: Public‑Cluster Online Mixed Deployment

In 2017 Didi started moving workloads to the cloud, first adopting Docker and cgroup, then gradually standardizing on Kubernetes. As more services (ride‑hailing, maps, etc.) migrated, container density increased, leading to serious resource contention and latency spikes.

Elastic Cloud Tiered Guarantee System

To keep CPU usage around 50 % during peak periods, a tiered guarantee system was built, providing quota management at the cluster level and priority‑based scheduling.

Kubernetes Scheduling Support

The scheduler selects a node for a new pod through two phases: pre‑selection (Predicates) and scoring (Priorities). Custom strategies such as ActualBalancedResourceAllocation, BalancedResourceAllocation, ActualLeastResourceAllocation, LeastResourceAllocation, InterPodAffinityPriority, NodeAffinityPriority, and TaintTolerationPriority are used to balance resource usage and respect topology constraints.

Rescheduling

Because cluster resources change (scale‑in/out, machine replacement) and workload patterns evolve, a rescheduling service periodically inspects nodes and triggers the scheduler to relocate pods that no longer meet the optimal placement criteria.

CPU usage charts show that public‑cluster online services maintain about 50 % utilization, confirming the effectiveness of the tiered guarantee system.

Phase 2: Public‑Cluster Offline Mixed Deployment

Further increasing online deployment density would cause severe contention; therefore, the strategy shifts to mixing offline tasks with online services during low‑load periods.

Offline tasks are scheduled based on the remaining CPU capacity of each host after accounting for online usage. Two scaling approaches are used:

Horizontal scaling : Adjust the number of offline pods on a node according to utilization predictions.

Vertical scaling : Adjust the resource spec of a single offline pod on each node to fill the residual capacity.

Container profiling predicts the maximum online utilization for the next hour, enabling the scheduler to allocate safe offline resources. Two prediction algorithms are used: a 7‑day historical ratio (now deprecated) and a weighted‑average method that combines 7‑day, 1‑day, and 1‑hour histories, achieving higher accuracy.

Phase 3: Isolation‑Cluster Mixed Deployment

Isolation clusters (e.g., dedicated Redis, MQ, or ingress services) have very low CPU utilization, leaving large mixed‑deployment potential. However, these services are latency‑sensitive and traditionally avoid mixing.

The current focus is to mix low‑priority online services into isolation clusters to raise CPU peak utilization while preserving the quality of the primary services.

Key technical components:

Kubernetes scheduling : Custom resources (e.g., mix-mid-cpu) encode the amount of CPU reserved for mixed services based on profiling.

Node‑level container limits : Bypass or adjust per‑node container count limits for mixed services.

Rule‑engine injection : Adapt generic scheduling rules (taint tolerations, topology spread, etc.) for isolation clusters.

Rescheduling : Ensure that mixed services do not interfere with the original isolation‑cluster workloads.

Eviction logic : Triggered by business metric anomalies, exceeding mixed‑deployment thresholds, interference detection, or manual commands. Eviction can be pod‑level, node‑level, or service‑level, with destinations prioritized as mixed‑cluster → self‑built IDC cluster → public cloud.

Future Outlook

With the steady‑state cloud migration plan, public‑cluster size may stay the same or shrink, making isolation clusters the main source of mixed resources. The roadmap includes further strengthening of cluster scheduling, service profiling, node isolation, interference detection, and anomaly perception to achieve full mixed deployment across all service types.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Resource Scheduling Dynamic Scaling mixed deployment performance isolation

Written by

Didi Tech

Official Didi technology account

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.

Definition of Mixed Deployment

Key Technical Challenges

Phase 1: Public‑Cluster Online Mixed Deployment

Phase 2: Public‑Cluster Offline Mixed Deployment

Phase 3: Isolation‑Cluster Mixed Deployment

Future Outlook

Didi Tech

How this landed with the community

Was this worth your time?

0 Comments

Phase 1: Public‑Cluster Online Mixed Deployment

Phase 2: Public‑Cluster Offline Mixed Deployment

Phase 3: Isolation‑Cluster Mixed Deployment