How Baidu Achieved 5‑9+ Availability: Inside Its Tracing and Observability Innovations
This article examines Baidu Search's massive micro‑service architecture and reveals the detailed observability, tracing, and metrics techniques—Kepler 1.0, Kepler 2.0, and Prometheus integration—that enable five‑nine‑plus availability, full‑query debugging, and efficient capacity management.
Baidu Search is one of the largest, most stable internet services, widely used as a reliability benchmark. The article explores the fine‑grained techniques behind its high availability, focusing on stability analysis, tracing, and metrics.
Chapter 1: Challenges
In massive micro‑service systems, failures are inevitable. Availability governance targets three pillars: system resilience, loss‑mitigation, and rapid root‑cause analysis; the article concentrates on the third.
Baidu Search faces three primary fault categories: PV loss (failure to return query results), search‑result quality issues, and capacity failures.
Chapter 2: Introducing and Localizing Solutions
Before 2014, fault analysis relied on sparse logs and metrics, leading to low efficiency and heavy manual effort. Inspired by Dapper, Baidu built Kepler 1.0 for query‑level tracing and adopted Prometheus‑compatible metrics, creating a hybrid observability platform.
2.1 Kepler 1.0 Overview
Kepler 1.0 provides end‑to‑end tracing and partial annotation for each query, enabling quick lookup of the processing path across services.
2.2 General Metrics Collection
Prometheus exporters were integrated with Baidu's PaaS system to expose container‑level metrics, supporting multi‑dimensional capacity management for mixed‑deployment clusters.
2.3 Early Application Scenarios
Reject and result‑quality issues – forced sampling of problematic queries allowed direct trace and log retrieval.
Speed issues – fine‑grained timestamps restored full call‑chain timing, leading to async TCP connect and callback optimizations.
Capacity issues – multi‑dimensional container metrics enabled precise resource auditing and consumption analysis.
Chapter 3: Innovation – Unlocking Value
While Kepler 1.0 and Prometheus opened observability, sampling limited full‑scale tracing. Kepler 2.0 decoupled tracing and logging, achieving full‑query trace and log collection with minimal overhead.
3.1 Full‑log Indexing
Logs are indexed in‑place using a four‑field “location” (ip, inode, offset, length), allowing O(1) retrieval without scanning the entire file system.
3.2 Full‑call‑graph
A derived span_id algorithm eliminates the need to store parent IDs, cutting storage by about 60 %. Custom compression—timestamp deltas, IP truncation, protobuf varint, and packed repeated fields—further reduces size.
3.3 Impact
Full tracing solved historical query‑level recall failures, enabled chain analysis of cache‑related effect bugs, and supported horizontal correlation of related queries, dramatically reducing blind spots in fault analysis.
These data‑centric advances eliminated major obstacles in problem investigation and set the stage for automated, intelligent failure handling in Baidu Search.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
21CTO
21CTO (21CTO.com) offers developers community, training, and services, making it your go‑to learning and service platform.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
