Cloud Native 20 min read

From Bare Metal to Kubernetes: Lessons and Pitfalls of Didi’s Elastic Cloud

This article recounts Didi’s Elastic Cloud journey, explaining why a private cloud was needed, why containers and Kubernetes were chosen, detailing the overall architecture, product features, networking, monitoring, logging, storage solutions, and sharing practical insights and future plans.

dbaplus Community
dbaplus Community
dbaplus Community
From Bare Metal to Kubernetes: Lessons and Pitfalls of Didi’s Elastic Cloud

Background and Motivation

Didi operates tens of thousands of physical servers. Without a private‑cloud platform the resource utilization was low and scaling new business units was difficult. A stable, elastic platform was required to allocate resources on‑demand, support rapid scaling, provide consistent environments and reduce manual operations.

Why Containers and Kubernetes

Traditional VMs (KVM/Xen) have high overhead and mixed‑workload deployments lack isolation. Containers give lightweight isolation and immutable images, simplifying configuration and deployment. Among container orchestration systems, Kubernetes was chosen for its container‑centric design, mature ecosystem, active community and extensible architecture. Didi’s platform runs on Kubernetes 1.6 and plans to upgrade to 1.8.

Overall Architecture

The Elastic Cloud platform consists of a central Kubernetes cluster (control plane + worker nodes), a management console, authentication services and an SDN network layer. Two product families are offered:

Static Container Groups – lightweight VMs built on containers, providing fixed IP/hostname and local storage.

Elastic Scaling Groups – pure micro‑service containers supporting stateless and stateful workloads, integrated with Didi’s deployment, monitoring and logging systems.

Network Design

The SDN solution uses ONOS + Open vSwitch (OVS) . Each container receives an overlay IP, enabling three communication modes:

Intra‑host container communication via the ovs‑int bridge.

Inter‑host container communication via ovs‑tunnel bridges and ONOS‑controlled tunnels.

Container‑to‑physical‑machine communication via overlay gateways with VXLAN encapsulation.

IP allocation varies by product type:

Static groups – fixed IP and hostname.

Stateful scaling groups – fixed IP/hostname with optional IP‑pool support.

Stateless scaling groups – dynamic IP pool.

Container Creation Process

The workflow is driven by the Kubernetes scheduler and Kubelet:

Kubelet receives a pod assignment and creates a base container without network configuration.

Kubelet launches the CNI plugin, passing the pod name, container ID and network‑namespace.

The CNI plugin requests an IP from the IP controller.

The IP controller selects a subnet, checks availability and asks the SDN IPAM for a virtual port (overlay IP).

IPAM returns the port and synchronises the information with ONOS.

The IP controller stores the port; for bound‑IP pods it records the association for future reuse, then returns the port details to the CNI plugin.

The CNI plugin configures a veth pair, sets routes, and attaches the container to the appropriate OVS bridge.

The CNI plugin reports success to Kubelet.

Kubelet launches the business containers inside the now‑configured network namespace.

Monitoring and Logging

Monitoring combines host‑level metrics with container‑level data collected via cAdvisor and cgroup statistics. All metrics are sent to the Odin visualization system.

Basic monitoring – captures container‑specific metrics unavailable from host agents.

Business monitoring – mirrors physical‑machine monitoring for application‑level insight.

Logging is near‑real‑time (≈2 min latency), persisted to Ceph or external storage, and supports rich post‑mortem analysis.

Storage Solutions

Two volume types are supported:

Host‑path volumes – used by static groups for high‑performance local storage.

Ceph network volumes – used by dynamic groups for replicated, migratable storage.

OverlayFS is the default filesystem; for optimal performance XFS should be created with ftype=1.

Image Marketplace

Images are layered to separate responsibilities:

Base environment images – built by platform admins, contain agents, tools and system configuration.

Service environment images – maintained by SREs, add runtime dependencies for a specific service line.

Service images – built from developers’ Dockerfiles, based on the service environment image.

Future Directions

Key planned improvements include:

Finer‑grained isolation beyond memory/CPU (e.g., cache‑level isolation).

Elastic scaling for static container groups to reduce operational cost.

Intelligent scaling algorithms that predict workload peaks and pre‑scale resources.

Contributing low‑customisation solutions back to the open‑source community.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

Cloud NativeKubernetesDidiprivate cloudcontainer networkingElastic Cloud
dbaplus Community
Written by

dbaplus Community

Enterprise-level professional community for Database, BigData, and AIOps. Daily original articles, weekly online tech talks, monthly offline salons, and quarterly XCOPS&DAMS conferences—delivered by industry experts.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.