Cloud Native 12 min read

Observability Practices in Baidu Search Platform: Real‑time Metrics, Tracing, Logging, and Topology at Hundred‑Billion Scale

This article explains how Baidu's search middle‑platform adopts cloud‑native observability—covering metrics, distributed tracing, log querying, and topology analysis—to ensure high availability, performance, and controllability for a system handling hundreds of billions of requests across millions of micro‑service instances.

Baidu Intelligent Testing
Baidu Intelligent Testing
Baidu Intelligent Testing
Observability Practices in Baidu Search Platform: Real‑time Metrics, Tracing, Logging, and Topology at Hundred‑Billion Scale

Baidu's search middle‑platform not only receives massive Aladdin traffic but also builds search capabilities for various vertical businesses, now handling traffic at the hundred‑billion level backed by thousands of micro‑service modules and hundreds of thousands of instances. Ensuring high availability, performance, and controllability requires comprehensive, multi‑dimensional observability.

1. Cloud‑Native Observability

Observability extends traditional monitoring by providing a high‑level overview of all distributed system links and detailed analysis when problems occur. Its core elements are Metrics, Traces, Logs, and an additional element, Topology analysis, which offers macro‑level insight into traffic spikes and service capacity.

2. Necessity in Cloud‑Native Architecture

Micro‑services, containers, and serverless increase system complexity, making rapid anomaly detection and clear system visualization essential.

3. Challenges Faced

Massive system scale: billions of daily requests and hundreds of billions of logs make conventional storage prohibitively expensive.

Scenario‑level observation: metrics explode to millions when breaking down by business scenarios, demanding efficient aggregation.

Macro‑level topology analysis: identifying root causes of traffic surges or latency spikes requires topology tools.

4. Our Solutions

4.1 Log Query & Distributed Tracing

We store only seed log metadata (logid, IP, timestamp) in a KV store. When a user queries a logid, we retrieve the associated IP and timestamp, fetch the full log from the instance, parse downstream IPs, and iteratively traverse to reconstruct the full call‑graph topology. To accelerate long‑running traces, we perform time‑based binary search (N‑way fseek) on ordered log files, achieving sub‑100 ms per‑instance retrieval and overall query times within seconds.

4.2 Metrics Monitoring

We embed a dependency library in each instance to collect and pre‑aggregate metrics. Collectors poll these pre‑aggregated values and write them to a TSDB, discarding raw instance‑level data. This design reduces on‑instance overhead to negligible levels and enables real‑time feedback (≈2 s) for QPS, latency, and percentile calculations using bucketed histograms (30 ms buckets, ≤15 ms error).

4.3 Topology Analysis

Traffic is colored with scenario identifiers and propagated via RPC. Each span carries its scenario tag and parent span name, allowing the system to reconstruct full call topologies from stored span metrics when a user selects a scenario.

5. Conclusion

By integrating the four observability pillars—Metrics, Traces, Logs, and Topology—we have built a low‑cost, high‑performance monitoring platform that supports Baidu's search middle‑platform at massive scale, and it underpins downstream products such as historical snapshots, intelligent alerts, and refusal analysis.

cloud nativeObservabilitymetricsloggingtracingtopology
Baidu Intelligent Testing
Written by

Baidu Intelligent Testing

Welcome to follow.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.