Operations 19 min read

How Baidu Achieved 5‑9+ Availability: Inside Its Tracing and Observability Innovations

This article examines Baidu Search's massive micro‑service architecture and reveals the detailed observability, tracing, and metrics techniques—Kepler 1.0, Kepler 2.0, and Prometheus integration—that enable five‑nine‑plus availability, full‑query debugging, and efficient capacity management.

21CTO

Jul 11, 2021

How Baidu Achieved 5‑9+ Availability: Inside Its Tracing and Observability Innovations

Baidu Search is one of the largest, most stable internet services, widely used as a reliability benchmark. The article explores the fine‑grained techniques behind its high availability, focusing on stability analysis, tracing, and metrics.

Chapter 1: Challenges

In massive micro‑service systems, failures are inevitable. Availability governance targets three pillars: system resilience, loss‑mitigation, and rapid root‑cause analysis; the article concentrates on the third.

Baidu Search faces three primary fault categories: PV loss (failure to return query results), search‑result quality issues, and capacity failures.

Chapter 2: Introducing and Localizing Solutions

Before 2014, fault analysis relied on sparse logs and metrics, leading to low efficiency and heavy manual effort. Inspired by Dapper, Baidu built Kepler 1.0 for query‑level tracing and adopted Prometheus‑compatible metrics, creating a hybrid observability platform.

2.1 Kepler 1.0 Overview

Kepler 1.0 provides end‑to‑end tracing and partial annotation for each query, enabling quick lookup of the processing path across services.

2.2 General Metrics Collection

Prometheus exporters were integrated with Baidu's PaaS system to expose container‑level metrics, supporting multi‑dimensional capacity management for mixed‑deployment clusters.

2.3 Early Application Scenarios

Reject and result‑quality issues – forced sampling of problematic queries allowed direct trace and log retrieval.

Speed issues – fine‑grained timestamps restored full call‑chain timing, leading to async TCP connect and callback optimizations.

Capacity issues – multi‑dimensional container metrics enabled precise resource auditing and consumption analysis.

Chapter 3: Innovation – Unlocking Value

While Kepler 1.0 and Prometheus opened observability, sampling limited full‑scale tracing. Kepler 2.0 decoupled tracing and logging, achieving full‑query trace and log collection with minimal overhead.

3.1 Full‑log Indexing

Logs are indexed in‑place using a four‑field “location” (ip, inode, offset, length), allowing O(1) retrieval without scanning the entire file system.

3.2 Full‑call‑graph

A derived span_id algorithm eliminates the need to store parent IDs, cutting storage by about 60 %. Custom compression—timestamp deltas, IP truncation, protobuf varint, and packed repeated fields—further reduces size.

3.3 Impact

Full tracing solved historical query‑level recall failures, enabled chain analysis of cache‑related effect bugs, and supported horizontal correlation of related queries, dramatically reducing blind spots in fault analysis.

These data‑centric advances eliminated major obstacles in problem investigation and set the stage for automated, intelligent failure handling in Baidu Search.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Microservices Tracing availability

Written by

21CTO

21CTO (21CTO.com) offers developers community, training, and services, making it your go‑to learning and service platform.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.

Chapter 1: Challenges

Chapter 2: Introducing and Localizing Solutions

2.1 Kepler 1.0 Overview

2.2 General Metrics Collection

2.3 Early Application Scenarios

Chapter 3: Innovation – Unlocking Value

3.1 Full‑log Indexing

3.2 Full‑call‑graph

3.3 Impact

21CTO

How this landed with the community

Was this worth your time?

0 Comments

2.1 Kepler 1.0 Overview