Didi's Seven‑Layer Access Platform: Service Governance, Stability Practices, and Cloud‑Native Exploration
Didi’s Seven‑Layer Access Platform, handling millions of QPS and hundreds of billions of daily requests across thousands of services, provides ultra‑stable, sub‑millisecond routing through Nginx‑based data and control planes, advanced service discovery, rate‑limiting, observability, zero‑risk change controls, and is now evolving toward a cloud‑native, mesh‑enabled sidecar architecture.
Didi's Seven‑Layer Access Platform is responsible for all east‑west and north‑south HTTP traffic across the company. It handles peak request rates of several million QPS, daily request volumes of hundreds of billions, thousands of domain names, thousands of services, and tens of thousands of forwarding rules. Its stable and efficient operation is essential for guaranteeing Didi’s business continuity.
Overview
Since its inception at the end of 2014, the platform has grown to serve the entire company’s HTTP traffic, with a peak of millions of QPS, daily traffic in the hundred‑billion range, and a massive number of routing rules. The platform aims for 99.99% availability, average forwarding latency under 1 ms, and to empower business stability and efficiency.
Architecture
The system is divided into a data plane and a control plane. The data plane is built on open‑source Nginx, providing high‑stability, high‑performance, multi‑protocol, and secure traffic ingress and service governance. The control plane includes a self‑developed configuration‑change platform, observability integration, service governance, service discovery, and security modules.
Service Governance Capability Building
Service Discovery : After evaluating four community upstream dynamic‑update solutions, Didi designed a custom Nginx upstream dynamic‑update module that supports incremental server updates, a lightweight HTTP RESTful reload API, full compatibility with existing configurations, and persistent storage. This module is integrated with the company‑wide name service DSI (Didi Service Information), enabling discovery for thousands of services.
Pre‑plan Capabilities : Multi‑dimensional rate limiting (with automatic threshold recommendation and adaptive thresholds), traffic splitting for multi‑active deployments, and fine‑grained fault injection based on error‑rate and latency metrics.
Observability : The platform emits standardized request data to the SRM monitoring system, providing fine‑grained, multi‑dimensional metrics and dashboards for the entire company’s HTTP traffic.
Stability Construction
Zero‑Risk Prevention : Identifies four major risk categories—code‑change anomalies, capacity shortage, external‑network HA gaps, and operational mistakes—and applies strict controls such as cross‑day change windows, double‑check procedures, and isolated operation domains.
Engine Upgrade : Migrated from the legacy tengine 2.1.0 (based on Nginx 1.6.2) to a modern Nginx version, fixing hidden bugs, adding missing features (e.g., worker_shutdown_timeout, reuseport), and improving extensibility. The upgrade yielded fixes for eight critical bugs, fifteen functional improvements, resolution of traffic jitter during releases, and enhancements to consistent‑hash algorithms.
Configuration‑Change Risk Control (Heim Platform) : Abstracts and constrains configuration models, provides syntax and semantic checks, enforces mandatory reviews, supports staged rollouts, double‑check mechanisms, and automatic risk alerts, effectively eliminating configuration‑induced P1/P2 incidents.
Cloud‑Native Era Exploration
Multi‑Protocol Support : Developed a Thrift codec module for Nginx that supports multiple serialization protocols (TBinaryProtocol, TCompactProtocol) and transport layers (Tsocket, TFramedTransport), works without IDL, and offers a modular, high‑performance design.
Meshification of the Access Engine : Investigates turning the centralized engine into a mesh component to address extreme stability requirements, operational efficiency, shared‑cluster risks, and capacity boundary uncertainties. Early experiments focus on client‑side mesh integration to close traffic loops.
Future Outlook
The team envisions a cloud‑native access engine that becomes a sidecar‑based, universally applicable application‑layer traffic management and service‑governance layer, similar to TCP/IP or SDN. A one‑stop seven‑layer platform will provide self‑service capabilities for both developers and SREs, reducing communication overhead and improving delivery efficiency. Multi‑model microservice governance will combine SDK‑based, centralized, and mesh approaches to ensure long‑term stability and scalability.
Conclusion
The platform now supports large‑scale service governance (over 400 rate‑limit plans, 2 400 limit APIs, 400 traffic‑split plans), company‑wide service discovery for thousands of services, and comprehensive observability. It maintains >99.99% availability and <1 ms latency, with three years of uninterrupted stability, and continues to evolve toward a cloud‑native, mesh‑enabled architecture.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
