Operations 21 min read

How Baidu Achieves 5‑9+ Availability: Inside Its Stability Engineering and Observability

This article dissects Baidu Search's ultra‑large micro‑service architecture, detailing the challenges of maintaining five‑nine‑plus availability, the diverse failure modes, and the step‑by‑step evolution of its observability stack—from early log‑only analysis to the kepler1.0/kepler2.0 tracing, full‑log indexing, custom span‑id generation, and compression techniques that together enable rapid root‑cause diagnosis at massive scale.

Baidu Geek Talk

Jun 30, 2021

How Baidu Achieves 5‑9+ Availability: Inside Its Stability Engineering and Observability

Chapter 1: Challenges

In a massive micro‑service ecosystem, failures are inevitable, so Baidu treats them as a normal state and focuses on three availability‑governance dimensions: system resilience, loss‑mitigation mechanisms, and rapid cause‑identification. This article concentrates on the third dimension—speeding up root‑cause analysis.

Complex System vs. Strict Availability

Baidu Search consists of an offline pipeline that builds multi‑petabyte indexes and an online layer that processes billions of queries per day. The system spans hundreds of services, runs on hundreds of thousands of machines, and must maintain >99.999% uptime, meaning less than five minutes of annual downtime.

Diverse Stability Issues

Failures fall into three broad categories: (1) PV loss (queries that never return results), (2) search‑effect problems (missing or mis‑ranked results, slow response), and (3) capacity issues (resource exhaustion leading to crashes). All categories share a common need for rich data collection and automated analysis.

Chapter 2: Introducing and Localizing Solutions

Before 2014, Baidu relied on raw service logs and sparse metrics, which required heavy manual effort and left many blind spots. Guided by the Dapper tracing paper, Baidu built kepler1.0 , a query‑sampling system that generates call‑chains and selective annotations, and complemented it with a Prometheus‑based metrics collector for container‑level observability.

2.1 kepler1.0 Overview

The architecture (see image) shows kepler1.0 capturing sampled query traces and exposing them via a web UI, enabling engineers to retrieve the full call‑graph and associated logs for a given query ID.

2.2 Metrics Collection Exploration

By integrating an open‑source Prometheus exporter with Baidu's PaaS metadata, kepler1.0 added multi‑dimensional container metrics, allowing fine‑grained capacity monitoring in mixed‑deployment clusters.

2.3 Early Impact

Scenario 1 – Rejection & Effect Issues: forced sampling of problematic queries provided deterministic trace and log data, dramatically reducing manual analysis time.

Scenario 2 – Speed Problems: precise timestamps in the call‑graph revealed long‑tail processing stages, leading to async TCP‑connect and callback‑blocking optimizations.

Scenario 3 – Capacity Issues: a new container‑level metric system supplied the missing visibility for capacity‑management dashboards.

Chapter 3: Innovation – Unlocking Value

3.1 Motivation

Open‑source tracing solutions could not meet Baidu’s scale (tens of PB of logs, trillions of spans) nor provide the completeness needed for effect‑related bugs. The goal became full‑query tracing with minimal resource overhead.

3.2 Full‑Log Indexing

Each machine creates a location index composed of ip, inode, offset, and length (20 bytes total). This index is stored locally and enables O(1) retrieval of any log segment without scanning files. Flexible secondary indexes (e.g., query terms, user IDs) support ad‑hoc queries and stream‑processing pipelines.

3.2.1 Full Call‑Graph

A span consists of parent‑span‑id, span‑id, child‑ip:port, and start/end timestamps. kepler2.0 replaces the random‑number span‑id with a deterministic derivation: starting from 0 at the root, each downstream service adds its IP value, propagating a unique span‑id without storing parent_span_id. This reduces per‑span writes from four records to two, cutting storage by ~60%.

3.2.2 Data Compression

Custom compression exploits domain characteristics:

Timestamp deltas + PForDelta achieve ~70% reduction for high‑fan‑out services.

IP addresses are stored as the last three bytes (10.0.0.0/24), saving 25% per IP.

Protobuf varint encodes sub‑64‑bit integers without waste.

Packed repeated fields eliminate per‑field tags, saving ~25% for typical fan‑out of 40.

3.2.3 Application Benefits

Full‑log and full‑call‑graph capabilities enable concrete use cases:

Historical Query Recall : engineers traced a missing Baidu Baike result to a failed shard in the index library, fixed it, and restored the expected recall.

Cache‑Induced Effect Bugs : by linking cache‑write queries (the “disturber”) with downstream queries (the “victim”) through horizontal correlation, Baidu identified and repaired dirty‑cache scenarios that previously broke the trace chain.

Capacity Auditing : container‑level metrics combined with log indexes allowed rapid consumption‑audit queries across billions of logs.

These data‑driven observability improvements eliminated many blind spots, turning root‑cause analysis from a bottleneck into a streamlined process. The next phase will focus on abstracting manual analysis experience into automated, intelligent fault‑resolution workflows.

Conclusion : By building a zero‑cost, in‑place log index, a deterministic span‑id scheme, and a highly compressed call‑graph, Baidu Search achieved near‑real‑time, full‑coverage observability that supports its five‑nine‑plus availability target.

observability Metrics Distributed Tracing large-scale systems Baidu Search kepler availability engineering

Written by

Baidu Geek Talk

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.