Operations 12 min read

How Baidu Engineers Scalable Service Governance: Capacity, Traffic, and Stability

This interview details Baidu's practical approach to microservice governance, covering its definition, the evolution from ad‑hoc scaling to automated capacity, traffic, and stability engineering, and the challenges of data collection, standardized interfaces, and decision‑making policies for large‑scale systems.

Baidu Geek Talk

May 26, 2021

How Baidu Engineers Scalable Service Governance: Capacity, Traffic, and Stability

In the second session of "Geek Talks," Baidu's recommendation architecture team shares how they define and implement service governance for their microservice ecosystem.

Q1: Definition of Service Governance

Service governance means keeping all microservices in a reasonable operating state, which includes:

Reasonable capacity: redundancy is neither excessive nor insufficient.

Health: services can tolerate short‑term anomalies without long‑term outages.

Observability: traffic topology and key metrics are transparent, monitorable, and controllable.

Q2: When Governance Started

Governance efforts began around late 2016 and early 2017 when the recommendation system was first built as a fast‑deployment "quick‑service" to support rapid business growth. By 2018, resource waste and technical debt became evident, prompting a shift toward sustainable, automated governance.

Q3: What Is Governed?

Three layers form the governance framework:

Capacity Governance : Automatic scaling mechanisms ensure reasonable resource utilization. Real‑time load monitoring and stress‑testing build capacity models for each application, establishing baselines to avoid waste.

Traffic Governance : Goals are observability, monitorability, controllability, and automation. Baidu adopts an open‑source Service Mesh (istio + envoy) with extensive customizations and performance tuning, providing unified load‑balancing, traffic‑break, traffic‑blackhole, and traffic‑mirroring capabilities.

Stability Engineering : A monitoring and alerting system, combined with chaos engineering, injects failures at various levels (container, host, network) to expose hidden risks. A "Resilience Index" scores the system's tolerance to failures, guiding improvement.

Capacity Governance in Practice

The team built an Application Lifecycle Management (ALM) platform that automatically generates capacity models through stress testing, enabling precise resource allocation. In 2020, this approach reclaimed over 10,000 machines, and recent integration with INF provides real‑time feedback, bringing capacity adjustment closer to serverless dynamics.

Traffic Governance Details

Beyond the Service Mesh, Baidu constructs a global Service Graph that aggregates connection data and traffic metrics from both the mesh and the internal brpc framework. This graph offers a full‑view of service dependencies, simplifies data‑center migrations, and standardizes intervention capabilities even for services not fully mesh‑enabled.

Stability Engineering and Chaos Experiments

Chaos experiments are run regularly, injecting failures from single containers to entire clusters. Results are scored: common failures receive higher scores to encourage focus on frequent issues, while rare, complex failures receive lower scores. The scoring informs the "Resilience Index" used to assess overall system stability.

Automation Challenges

Three main challenges arise:

Acquiring reliable data for decision‑making, especially for service‑level and quality metrics that vary across business units.

Designing standardized operation interfaces: capacity interfaces via ALM and traffic interfaces via the Service Mesh control plane.

Balancing policy sensitivity and accuracy; overly aggressive automation can cause system jitter, while overly conservative policies miss timely issue resolution.

Mindset Shift

The speaker emphasizes moving from manual, experience‑based processes to code‑embedded, observable, and automated mechanisms. Embedding operational knowledge into code prevents knowledge loss, reduces reliance on individual expertise, and lowers the cost of onboarding new developers.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Microservices Service Mesh traffic management service governance capacity management stability engineering

Written by

Baidu Geek Talk

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.