Cloud Native 13 min read

How Baidu’s Search Platform Achieves Billion‑Scale Observability in a Cloud‑Native Era

This article explains how Baidu’s search middle‑platform implements full‑stack observability—metrics, tracing, logs, and topology analysis—to reliably monitor and troubleshoot a system handling billions of requests across hundreds of thousands of microservice instances.

Baidu Geek Talk

Mar 3, 2021

How Baidu’s Search Platform Achieves Billion‑Scale Observability in a Cloud‑Native Era

1. Cloud Native and Observability

What is Observability?

Observability extends traditional monitoring by providing a high‑level view of all distributed system links and enabling fine‑grained analysis when problems occur, allowing developers and operators to understand every system behavior.

Metrics

Traces

Logs

Beyond these three, Baidu adds a fourth element—Topology analysis—which offers a macro‑level view of request flows, helping locate the source of traffic spikes or latency anomalies.

Why Observability Matters in Cloud‑Native Environments

Microservices, containers, and serverless architectures increase system complexity and reduce centralized control, making rapid fault isolation and clear system visibility essential.

2. Challenges Faced

Massive System Scale

With billions of daily requests, the search platform runs thousands of microservice modules and hundreds of thousands of instances. Storing full request traces in a centralized store would require hundreds of machines, making traditional solutions prohibitively expensive.

From Application‑Level to Scenario‑Level Monitoring

Business scenarios now vary widely; a single application may host dozens of scenarios with vastly different traffic volumes. Application‑level metrics alone miss anomalies in low‑traffic scenarios, so fine‑grained, scenario‑level metrics are needed, increasing metric cardinality to the million‑level.

Macro‑Level Topology Analysis

When overall traffic surges, latency percentiles rise, or rejection rates increase, operators need topology tools to assess capacity impact across services and guide decisions such as scaling buffers for specific scenarios.

3. Solutions Implemented

Log Query and Distributed Tracing

To avoid storing petabytes of logs, Baidu stores only a small seed of log metadata (logid, IP, timestamp) in a KV store at the traffic entry point. When a user queries a logid, the system retrieves the corresponding IP and timestamp, fetches the full log from the target instance, parses downstream instance information, and repeats the process to reconstruct the complete call‑graph topology.

Record logid, IP, and timestamp at the entry layer into KV storage.

Lookup IP and timestamp for a given logid when queried.

Fetch the full log from the instance identified by IP and timestamp.

Parse the log to obtain downstream instance IP and timestamp.

Iteratively repeat steps 3‑4 to build the full call chain.

To accelerate long‑running traces, Baidu uses a time‑based N‑way binary search: the log file is partitioned according to time, and the search repeatedly jumps to the appropriate segment, reducing per‑instance trace latency to under 100 ms and overall user query time to seconds.

Metrics Monitoring

The monitoring architecture embeds a lightweight library in each instance to collect raw metrics, perform local pre‑aggregation, and push the aggregated data to a collector. The collector further aggregates instance‑level data into scenario‑ or service‑level metrics before storing them in a TSDB, discarding raw instance metrics to save space.

This design achieves a 2‑second end‑to‑end feedback loop with negligible resource overhead. For percentile latency calculation, Baidu replaces full sorting with bucket‑based counting: each request increments a bucket corresponding to its latency; during percentile computation, the bucket containing the target percentile is identified and linear interpolation within the bucket yields an approximate value. With a 30 ms bucket size, the error stays within 15 ms, which is acceptable for performance monitoring.

Topology Analysis

Traffic is “colored” with scenario identifiers and propagated via RPC to each service. Each span (as defined in the Dapper paper) carries its scenario tag and parent span name, allowing the system to reconstruct parent‑child relationships. Span information is stored as metrics, enabling the platform to extract all spans for a given scenario and assemble a complete call topology.

4. Conclusion

The four core observability elements—Metrics, Traces, Logs, and Topology—have been fully realized in Baidu’s search middle‑platform, powering downstream products such as historical snapshots, intelligent alerts, and rejection analysis. Future work aims to build self‑adaptive mechanisms that automatically tolerate and recover from anomalies, further enhancing system resilience.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Cloud Native Logging Tracing Baidu Search Topology Analysis

Written by

Baidu Geek Talk

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.