Baidu's Internal Service Mesh Practice: Architecture, Challenges, and Optimizations
This article details Baidu's internal adoption of a service mesh built on Istio and Envoy, covering the motivations, architectural design, low‑invasion integration methods, extreme performance tuning, stability and traffic governance capabilities, surrounding ecosystem tools, and the resulting operational benefits.
Baidu's majority of product lines have completed microservice transformation, resulting in tens of thousands of services that demand higher service governance capabilities. Traditional RPC‑based governance suffers from inconsistent framework abilities, low efficiency, and insufficient global observability.
To fundamentally solve these pain points, Baidu introduced a service mesh that decouples governance from RPC frameworks, pushes capabilities to sidecars, and provides unified stability and traffic control interfaces across the organization.
The mesh implementation faced several technical challenges: achieving low intrusion for hundreds of product lines with millions of instances, maintaining ultra‑low latency for latency‑sensitive core services, integrating heterogeneous language frameworks and existing governance systems, and ensuring high reliability of the mesh itself.
The overall architecture is built on the open‑source Istio + Envoy stack, customized for Baidu's internal scenarios. It includes a Mesh Control Center (access, configuration, and operations), a control plane (istio‑pilot), a data plane (Envoy), dependency components such as the internal naming service, RPC adapters, monitoring, and PaaS support, as well as a surrounding governance ecosystem (auto‑tuning, fault auto‑location, chaos engineering, etc.).
Two transparent migration approaches were designed: a local lookback‑IP traffic hijacking scheme that injects sidecars via service discovery, and a proxy‑less solution that adapts various RPC frameworks to Istio’s xDS without traffic hijacking. Both enable zero‑code changes for business modules.
Performance optimization revealed that the community Envoy’s single‑process libevent model caused significant latency and CPU overhead. Baidu extended Envoy with the high‑performance brpc bthread model, creating a brpc‑Envoy variant that reduces CPU usage by over 60% and average latency by more than 70%, with long‑tail latency improvements of around 75%. Ongoing research on eBPF and DPDK promises further gains.
Stability governance includes advanced fault‑tolerance (dynamic retries, circuit breaking), rapid fault detection (minute‑level instead of hours), and unified intervention and degradation policies. These measures lifted availability from two nines to four nines for key modules and cut snow‑avalanche case losses by 44% year‑over‑year.
For traffic governance, Baidu built a global service graph using Istio CRDs, standardized golden‑metric storage, and fine‑grained traffic scheduling down to instance level, including traffic mirroring for testing. This greatly improved observability and control of complex call chains.
The mesh also enabled a surrounding ecosystem: an automatic parameter tuning system that adjusts timeout and weight settings based on real‑time metrics, a fault auto‑sense and mitigation system that automates pre‑plan execution, and standardized xDS protocols to integrate diverse surrounding tools.
Since the project’s inception at the end of 2019, the mesh has been deployed in dozens of product lines within two years, covering over 80% of core modules and handling trillions of daily requests with near‑zero integration cost. It has dramatically reduced governance iteration cycles, lowered operational costs, and significantly enhanced overall system stability.
Baidu Intelligent Testing
Welcome to follow.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.