Cloud Native 17 min read

Design and Optimization of a Host Lifecycle Management Platform for Elastic Cloud

The host‑lifecycle platform (mmachine) standardizes and automates machine onboarding, maintenance, and decommissioning in DiDi’s elastic cloud by using a four‑layer architecture, custom asynchronous scheduler, backup‑machine pool, cloud‑native batching, and idle‑resource governance, turning a manual, day‑level process into a reliable, minute‑level workflow while cutting costs.

Didi Tech

Oct 31, 2023

Design and Optimization of a Host Lifecycle Management Platform for Elastic Cloud

In 2020, the manual process of bringing machines online required hopping between eight services and repeated manual steps. As DiDi’s business moved to the cloud, the elastic cloud added many physical machines, leading to hundreds of onboarding steps and exposing the problem that scaling the onboarding speed required massive human effort.

DevOps – Standards First

Standardization is crucial in DevOps. All machines in the elastic cloud are managed through a service‑tree. Previously, manual management caused chaotic service‑tree mappings. To improve this, the elastic cloud defined service‑tree node standards and linked host lifecycle to these nodes. The defined processes are:

Machine onboarding: from a backup node to kube-node-init, then to the online kube-node for alert association.

Machine maintenance: online machines are mounted to a maintain node, repaired, then returned to kube-node-init for re‑onboarding.

Machine decommissioning: after container migration, machines are moved to pre-offline.backup for shutdown.

Process Decomposition and Requirement Analysis

After defining standards, the onboarding, offline, and maintenance processes were broken down (see diagram). The analysis revealed the following functional requirements for the platform:

Long‑running, stream‑type tasks must be executed asynchronously.

The platform depends on many third‑party services, so it must support skip, retry, and pause operations.

Repeated steps should be freely composable to improve flexibility.

Tasks should be presented as work orders following a double‑check principle.

Architecture Design and Code Development

The system is divided into four layers:

Admission layer – user and machine access control.

Control layer – work‑order creation, execution, and closure.

Scheduling layer – task scheduling and distribution.

Execution layer – step composition, task execution, and status feedback.

Most services are written in Go. The admission and control layers use common web frameworks, while the scheduling and execution layers require an asynchronous task‑scheduling framework. Initially, the open‑source framework machinery was evaluated, but it lacked pause/skip support, so a custom scheduler was built. The first version of the host‑lifecycle service (named mmachine) was launched after two months.

Offline, Repair, and Re‑Onboarding

The first version only automated onboarding; offline and repair still required manual intervention. Due to container IP‑based communication, some containers could not be migrated, making manual offline risky. A “1‑2‑10” principle was introduced to decide when a machine could be safely decommissioned, based on replica counts and non‑running containers.

If a cluster has <5 replicas and >1 non‑running containers, the machine cannot be decommissioned.

If a cluster has 5‑20 replicas and >2 non‑running containers, the machine cannot be decommissioned.

If a cluster has >20 replicas and non‑running containers exceed 10% of total replicas, the machine cannot be decommissioned.

After container migration and a final safety check respecting the 1‑2‑10 rule, the machine is shut down.

Backup Machine Management

To accelerate resource delivery, a backup‑machine pool was created. By removing the taint from backup machines, they can be moved from an offline pool to the public pool within minutes. The module provides automatic onboarding to the offline pool, rapid scaling to the public pool when needed, and fast return when the system team requires the machines.

Cloud‑Native Management

Cloud‑native expansion introduced new challenges: thousands of VMs are provisioned daily, each with multiple downstream services that have different rate‑limit policies. Initialization time exceeded five minutes, and large‑scale work orders could stall.

Solutions included:

Batching machines in groups of 200 to respect rate limits.

Using pre‑built VM images to skip lengthy initialization, reducing the onboarding flow from 23 steps to 10.

Adding an isRunning flag and indexing critical DB fields to avoid lock contention in the scheduler.

Cost Optimization

Server costs dominate the elastic cloud budget. By implementing container governance, host shrinkage, and idle‑resource management, the platform reduces server spend. Machines are classified as online, buffer, or low‑load. An idle‑resource control module notifies owners, allows exemptions for expected low‑load machines, and automatically returns long‑term idle machines.

Accelerating Machine Decommissioning

Decommissioning remained slow due to serial container migration (≈10 min per host, max 50 hosts/day) and orphan containers. The process was changed to parallel migration per cluster, increasing throughput to 100 hosts per hour while still respecting the 1‑2‑10 rule. A three‑step notification and escalation policy ensures orphan containers are handled promptly.

Low‑Load Resource Governance

Machines idle for more than ten days in 2022 were identified and categorized as expected low‑load (e.g., new rooms not yet used) or unexpected low‑load (e.g., forgotten resources). The idle‑resource module automatically notifies owners, allows exemptions, and returns unclaimed machines, thereby improving overall utilization.

Conclusion

The host lifecycle management platform (mmachine) transformed a manual, day‑level onboarding process into an automated, minute‑level workflow. Continuous optimizations—standardized processes, parallel task flows, backup‑machine management, cloud‑native adaptations, and cost‑saving measures—significantly improved efficiency, stability, and resource utilization for the elastic cloud.

automation cost optimization host lifecycle

Written by

Didi Tech

Official Didi technology account

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.