How Didi’s Elastic Cloud Scales Millions of Nodes with Advanced k8s Scheduling
This article details Didi’s Elastic Cloud container platform, explaining its Kubernetes‑based scheduling architecture, custom pre‑filter and scoring extensions, service‑profiling driven placement, rescheduling mechanisms, rule‑engine integration, upgrade strategies from k8s 1.12 to 1.20, and the stability framework that keeps a massive multi‑tenant fleet running reliably.
Scheduling Overview
Elastic Cloud is an internal container platform built on Kubernetes. It manages millions of nodes and pods and adds custom pre‑filter, filter, and scoring extensions to enforce business‑specific placement policies.
Scheduling Chain Diagram : High‑level view of the Elastic Cloud scheduling pipeline.
K8s Scheduling Capabilities : Native predicate and priority phases and the enhancements made for Elastic Cloud.
K8s Version Upgrade : Migration from version 1.12 to 1.20, addressing large cluster size, heterogeneous workloads, and surrounding components.
Custom Scheduling Enhancements
Service Profiling / Real‑Usage Scheduling : A Prometheus‑based engine (zhivago) collects per‑node usage metrics, computes seven‑day maximum values, and feeds them into the scheduler to avoid hotspot creation.
Rescheduling (kube‑rescheduler) : Periodically scans for pods that no longer satisfy the "good case" criteria and triggers migration to healthier nodes.
Rule Engine (Galahad + Webhook) : Configurable policy injection via a mutating webhook, allowing dynamic adjustment of annotations, labels, resources, taints/tolerations, and affinity rules without code changes.
Upgrade Strategies
Dual‑Cluster Replacement : Deploy a brand‑new 1.20 master and components, gradually shift workloads, and retain the ability to roll back.
In‑Place Upgrade : Upgrade the master and all components in a single cluster, then upgrade kubelet on each node, ensuring zero‑downtime for workloads.
The in‑place upgrade was selected for its lower operational cost and seamless traffic migration.
Stability Framework
End‑to‑end trace from apiserver → webhook → etcd.
Detailed scheduling failure logs, scoring traces, and pod/node snapshots.
Core metrics such as scheduling success rate, latency, and component QPS visualized on dashboards.
Automated regression testing, fault‑injection drills, and emergency balancing for high CPU utilization.
Key Metrics and Results
After applying profiling‑driven placement and rescheduling, the average cluster utilization reached ~50% with no major incidents. Hotspot rates dropped significantly, and the system maintained 99.99% availability.
Service Profiling / Real‑Usage Scheduling Details
zhivago collects metrics via Prometheus, normalizes and aggregates them, and stores the 7‑day maximum CPU and memory usage per node. The scheduler treats these values as effective capacity limits, preventing placement that would create hotspots.
sum(max_over_time(ddcloud_pod_cpu_used{region="beijing", set="X2234", prod_line="wyc", cpu_limit="8", flexible="true"}[7d])) by (m_host)Rescheduling (kube‑rescheduler)
Rescheduling runs four periodic tasks:
Every 10 minutes, scan for pods that violate the "good case" criteria and enqueue them.
Every minute, attempt migration for enqueued pods.
Every minute, re‑check failed migrations and re‑enqueue if needed.
Daily, generate reports on rescheduling success rate, failure count, and unrecovered pods.
Rule Engine (Galahad + Webhook)
Galahad stores policy objects (annotations, labels, resource requests/limits, taints/tolerations, affinity) and the mutating webhook injects them into Pods or Nodes at creation time. This makes scheduling policies configurable without redeploying code and supports per‑business, per‑cluster, and per‑workload policies.
Kubernetes Version Upgrade (1.12 → 1.20)
Challenges:
Cluster size far exceeds the community‑recommended 5 k node limit, increasing failure impact.
Complex workloads including stateful services, immutable‑IP containers, and static pods.
Wide range of upstream components (kube‑odin, operators, controllers) that needed compatibility adjustments.
Two upgrade paths were evaluated:
Dual‑cluster replacement: build a new 1.20 control plane, migrate workloads gradually, keep the old cluster for rollback.
In‑place upgrade: upgrade the existing control plane to 1.20, then upgrade kubelet on each node, migrate workloads transparently.
The in‑place upgrade was chosen for lower cost and seamless traffic migration. Steps included upgrading the API server and controller‑manager, ensuring API compatibility, then rolling out kubelet upgrades node‑by‑node while monitoring stability.
Scheduler Framework Integration
All custom pre‑filter, filter, and scoring logic were re‑implemented using the Scheduler Framework’s PreFilter, Filter, and Score extension points, providing a clean separation from the core scheduler.
Observability and Reliability
Trace from apiserver → webhook → etcd for each scheduling decision.
Failure details, scoring logs, and snapshots of pod and node state captured for debugging.
Cluster‑wide metrics (success rate, latency, QPS) displayed on dashboards.
Automated regression tests and fault‑injection drills validate scheduler behavior under adverse conditions.
Future Work
Finer‑grained business scheduling policies.
Enhanced isolation capabilities for multi‑tenant clusters.
Support for multiple concurrent schedulers.
Further improvements to observability and metric granularity.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
