How Baidu’s Search Platform Achieves Billion‑Scale Observability in a Cloud‑Native Era
This article explains how Baidu’s search middle‑platform implements full‑stack observability—metrics, tracing, logs, and topology analysis—to reliably monitor and troubleshoot a system handling billions of requests across hundreds of thousands of microservice instances.
1. Cloud Native and Observability
What is Observability?
Observability extends traditional monitoring by providing a high‑level view of all distributed system links and enabling fine‑grained analysis when problems occur, allowing developers and operators to understand every system behavior.
Metrics
Traces
Logs
Beyond these three, Baidu adds a fourth element—Topology analysis—which offers a macro‑level view of request flows, helping locate the source of traffic spikes or latency anomalies.
Why Observability Matters in Cloud‑Native Environments
Microservices, containers, and serverless architectures increase system complexity and reduce centralized control, making rapid fault isolation and clear system visibility essential.
2. Challenges Faced
Massive System Scale
With billions of daily requests, the search platform runs thousands of microservice modules and hundreds of thousands of instances. Storing full request traces in a centralized store would require hundreds of machines, making traditional solutions prohibitively expensive.
From Application‑Level to Scenario‑Level Monitoring
Business scenarios now vary widely; a single application may host dozens of scenarios with vastly different traffic volumes. Application‑level metrics alone miss anomalies in low‑traffic scenarios, so fine‑grained, scenario‑level metrics are needed, increasing metric cardinality to the million‑level.
Macro‑Level Topology Analysis
When overall traffic surges, latency percentiles rise, or rejection rates increase, operators need topology tools to assess capacity impact across services and guide decisions such as scaling buffers for specific scenarios.
3. Solutions Implemented
Log Query and Distributed Tracing
To avoid storing petabytes of logs, Baidu stores only a small seed of log metadata (logid, IP, timestamp) in a KV store at the traffic entry point. When a user queries a logid, the system retrieves the corresponding IP and timestamp, fetches the full log from the target instance, parses downstream instance information, and repeats the process to reconstruct the complete call‑graph topology.
Record logid, IP, and timestamp at the entry layer into KV storage.
Lookup IP and timestamp for a given logid when queried.
Fetch the full log from the instance identified by IP and timestamp.
Parse the log to obtain downstream instance IP and timestamp.
Iteratively repeat steps 3‑4 to build the full call chain.
To accelerate long‑running traces, Baidu uses a time‑based N‑way binary search: the log file is partitioned according to time, and the search repeatedly jumps to the appropriate segment, reducing per‑instance trace latency to under 100 ms and overall user query time to seconds.
Metrics Monitoring
The monitoring architecture embeds a lightweight library in each instance to collect raw metrics, perform local pre‑aggregation, and push the aggregated data to a collector. The collector further aggregates instance‑level data into scenario‑ or service‑level metrics before storing them in a TSDB, discarding raw instance metrics to save space.
This design achieves a 2‑second end‑to‑end feedback loop with negligible resource overhead. For percentile latency calculation, Baidu replaces full sorting with bucket‑based counting: each request increments a bucket corresponding to its latency; during percentile computation, the bucket containing the target percentile is identified and linear interpolation within the bucket yields an approximate value. With a 30 ms bucket size, the error stays within 15 ms, which is acceptable for performance monitoring.
Topology Analysis
Traffic is “colored” with scenario identifiers and propagated via RPC to each service. Each span (as defined in the Dapper paper) carries its scenario tag and parent span name, allowing the system to reconstruct parent‑child relationships. Span information is stored as metrics, enabling the platform to extract all spans for a given scenario and assemble a complete call topology.
4. Conclusion
The four core observability elements—Metrics, Traces, Logs, and Topology—have been fully realized in Baidu’s search middle‑platform, powering downstream products such as historical snapshots, intelligent alerts, and rejection analysis. Future work aims to build self‑adaptive mechanisms that automatically tolerate and recover from anomalies, further enhancing system resilience.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
