How Baidu Engineers Scalable Service Governance: Capacity, Traffic, and Stability
This interview details Baidu's practical approach to microservice governance, covering its definition, the evolution from ad‑hoc scaling to automated capacity, traffic, and stability engineering, and the challenges of data collection, standardized interfaces, and decision‑making policies for large‑scale systems.
In the second session of "Geek Talks," Baidu's recommendation architecture team shares how they define and implement service governance for their microservice ecosystem.
Q1: Definition of Service Governance
Service governance means keeping all microservices in a reasonable operating state, which includes:
Reasonable capacity: redundancy is neither excessive nor insufficient.
Health: services can tolerate short‑term anomalies without long‑term outages.
Observability: traffic topology and key metrics are transparent, monitorable, and controllable.
Q2: When Governance Started
Governance efforts began around late 2016 and early 2017 when the recommendation system was first built as a fast‑deployment "quick‑service" to support rapid business growth. By 2018, resource waste and technical debt became evident, prompting a shift toward sustainable, automated governance.
Q3: What Is Governed?
Three layers form the governance framework:
Capacity Governance : Automatic scaling mechanisms ensure reasonable resource utilization. Real‑time load monitoring and stress‑testing build capacity models for each application, establishing baselines to avoid waste.
Traffic Governance : Goals are observability, monitorability, controllability, and automation. Baidu adopts an open‑source Service Mesh (istio + envoy) with extensive customizations and performance tuning, providing unified load‑balancing, traffic‑break, traffic‑blackhole, and traffic‑mirroring capabilities.
Stability Engineering : A monitoring and alerting system, combined with chaos engineering, injects failures at various levels (container, host, network) to expose hidden risks. A "Resilience Index" scores the system's tolerance to failures, guiding improvement.
Capacity Governance in Practice
The team built an Application Lifecycle Management (ALM) platform that automatically generates capacity models through stress testing, enabling precise resource allocation. In 2020, this approach reclaimed over 10,000 machines, and recent integration with INF provides real‑time feedback, bringing capacity adjustment closer to serverless dynamics.
Traffic Governance Details
Beyond the Service Mesh, Baidu constructs a global Service Graph that aggregates connection data and traffic metrics from both the mesh and the internal brpc framework. This graph offers a full‑view of service dependencies, simplifies data‑center migrations, and standardizes intervention capabilities even for services not fully mesh‑enabled.
Stability Engineering and Chaos Experiments
Chaos experiments are run regularly, injecting failures from single containers to entire clusters. Results are scored: common failures receive higher scores to encourage focus on frequent issues, while rare, complex failures receive lower scores. The scoring informs the "Resilience Index" used to assess overall system stability.
Automation Challenges
Three main challenges arise:
Acquiring reliable data for decision‑making, especially for service‑level and quality metrics that vary across business units.
Designing standardized operation interfaces: capacity interfaces via ALM and traffic interfaces via the Service Mesh control plane.
Balancing policy sensitivity and accuracy; overly aggressive automation can cause system jitter, while overly conservative policies miss timely issue resolution.
Mindset Shift
The speaker emphasizes moving from manual, experience‑based processes to code‑embedded, observable, and automated mechanisms. Embedding operational knowledge into code prevents knowledge loss, reduces reliance on individual expertise, and lowers the cost of onboarding new developers.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
