Baidu’s Low‑Intrusion, High‑Performance Service Mesh: Architecture & Lessons
This article details Baidu’s internal service‑mesh deployment, explaining why traditional RPC‑based governance fell short, how a sidecar‑based mesh decouples governance from frameworks, and the technical challenges and solutions for low‑intrusion, high‑performance, fault‑tolerant traffic management across tens of thousands of microservices.
Background
Baidu’s many product lines have migrated to microservices, creating tens of thousands of services that demand robust governance. Existing RPC frameworks (C++, Go, PHP, etc.) offered uneven capabilities, low efficiency, and poor global observability, leading to repeated development of advanced features like dynamic circuit breaking and timeout handling.
Why Service Mesh?
To address these pain points, Baidu introduced a service mesh that decouples governance from the underlying RPC frameworks and pushes capabilities to sidecars. The mesh provides unified stability functions and traffic‑control interfaces, built collaboratively across departments.
Technical Challenges
Low intrusion : Enable seamless migration for hundreds of product lines and millions of instances without code changes.
High performance : Reduce latency and CPU overhead to meet the strict requirements of core services like search and recommendation.
Heterogeneous system integration : Bridge multiple languages, unify interfaces, and integrate existing discovery, routing, and fault‑tolerance systems.
Mesh reliability : Ensure the mesh itself remains stable under production loads.
Overall Architecture
The solution is built on open‑source Istio + Envoy, customized for Baidu’s environment. Core components include a Mesh Control Center, Access Center (sidecar injection and version management), Configuration Center (stability and traffic policies), Operations Center (runtime interventions), Pilot (routing management), Envoy data plane, and integration with internal naming services and RPC adapters.
Access Methods
Traditional iptables‑based traffic hijacking caused high latency at scale. Baidu adopted a local look‑back IP approach: Envoy intercepts service‑discovery requests, transparently hijacking traffic while a local naming agent monitors Envoy health and falls back to direct connections if needed.
For services that cannot use traffic hijacking, a proxy‑less mode adapts existing RPC frameworks to the Istio XDS API, allowing mesh‑based governance without sidecar injection.
Extreme Performance Optimization
Community Envoy exhibited high latency and CPU usage, especially under large fan‑out scenarios. Baidu identified the single‑process, multi‑threaded libevent model as the bottleneck and replaced it with a high‑performance BThread coroutine model from the internal brpc framework, creating a “brpc‑Envoy” variant. Users can switch between the original and high‑performance models via Pilot.
Benchmarks show >60% CPU reduction, >70% average latency reduction, and >75% tail‑latency improvement compared to open‑source Envoy and other industry solutions.
Stability Governance
The mesh provides unified fault‑tolerance, detection, and intervention capabilities. Advanced retry and circuit‑breaking strategies dynamically adjust based on latency percentiles, while feedback‑driven load balancing reduces impact from faulty instances. Unified degradation interfaces enable rapid, consistent response to large‑scale incidents.
Deployments have increased availability from “two nines” to “four nines” and reduced avalanche‑type failures by over 44% year‑over‑year.
Traffic Management
By modeling service graphs with Istio CRDs and a custom configuration center, Baidu achieves global call‑graph visibility and fine‑grained traffic scheduling, including per‑instance traffic mirroring for testing and gradual rollouts.
Ecosystem Collaboration
The mesh’s unified control interface powers surrounding systems such as automatic parameter tuning, fault‑auto‑mitigation, self‑healing, and traffic steering. These systems consume mesh metrics and policies via the XDS protocol, enabling consistent behavior across diverse RPC frameworks.
Self‑Stability Assurance
Multi‑level fallback mechanisms ensure traffic continuity: Envoy instances revert to direct connections on local failures, while an external control plane can blacklist problematic proxies within minutes. Configuration releases are staged with gray‑scale rollout controls, and regular chaos‑engineering tests inject failures to validate resilience.
Conclusion
Since its inception in late 2019, Baidu’s mesh has been deployed across dozens of product lines, covering over 80% of core modules and handling traffic in the order of 10^14 requests per day. The platform delivers low‑intrusion, low‑cost, standardized service governance, dramatically reducing iteration cycles and improving overall system stability.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
21CTO
21CTO (21CTO.com) offers developers community, training, and services, making it your go‑to learning and service platform.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
