Cloud Native 12 min read

How Baidu’s Search Platform Achieves Billion‑Scale Observability in a Cloud‑Native Era

This article explains why observability is critical in cloud‑native architectures and describes how Baidu’s search middle‑platform handles hundred‑billion‑level traffic by implementing low‑cost real‑time metrics, distributed tracing, log querying and topology analysis, while tackling challenges of massive microservice scale, scenario‑level monitoring, and efficient resource usage.

dbaplus Community

Apr 5, 2023

How Baidu’s Search Platform Achieves Billion‑Scale Observability in a Cloud‑Native Era

Observability Overview

Observability is a superset of traditional monitoring that provides a high‑level view of all link‑level states in a distributed system and enables detailed root‑cause analysis when failures occur. In cloud‑native environments it is considered a core characteristic of the architecture.

The four fundamental elements are:

Metrics monitoring

Distributed tracing

Log querying

Topology analysis (Topos)

Challenges in Baidu Search Middle‑Platform

Ultra‑large system scale – Hundreds of thousands of instances and billions of daily requests generate petabyte‑level log volumes. Storing raw logs for Dapper‑style tracing would require hundreds of machines.

From application‑level to scenario‑level monitoring – Different business scenarios produce vastly different traffic volumes. Monitoring only at the application level can miss anomalies in low‑traffic scenarios, while fine‑grained scenario metrics increase the total metric count to the million‑level, stressing aggregation and computation.

Macro‑level topology analysis – Sudden traffic spikes, latency percentile shifts, or rising rejection rates demand a system‑wide view of service topology to assess capacity buffers and guide routing or feature‑toggle decisions.

Implemented Solutions

Log Query and Distributed Tracing

Only a small seed of log metadata (logid, IP, timestamp) is stored in a KV store at the traffic entry point. When a user queries by logid, the system retrieves the associated IP and timestamp, fetches the full log from the target instance, and reconstructs the call chain.

To avoid full‑file greps, a time‑based dynamic N‑partition search is applied. The algorithm estimates the file offset from the requested minute (e.g., 15 min ≈ 1/4 hour), performs an fseek to that position, reads a small window, and iteratively narrows the range. This yields sub‑100 ms per‑instance query latency and overall second‑level response time.

Metrics Monitoring

A lightweight library is embedded in each online instance to collect raw metrics. The library performs local pre‑aggregation, then a collector polls the instances, writes the aggregated data to a time‑series database (TSDB), and discards per‑instance raw metrics. This design reduces on‑instance overhead to near‑zero and provides a 2‑second feedback loop from metric change to dashboard display.

Percentile latency is computed using a bucket‑based method: request latencies are placed into 30 ms buckets. To obtain a percentile, the bucket containing the target rank is identified and linear interpolation within the bucket yields the estimate. The approach delivers sub‑15 ms error with minimal CPU and memory consumption.

Topology Analysis

At the entry point traffic is “colored” with a scenario identifier. The color information is propagated via RPC to downstream services. Each span records the scenario ID and its parent span name, and the span data is emitted as metrics through the same collection pipeline. When a user supplies a scenario ID, the system extracts all related metrics, reconstructs the full call topology, and visualizes it.

Results and Future Directions

The integrated observability platform now supports real‑time monitoring of a system handling billions of requests per day with petabyte‑scale log volume, while keeping resource consumption lightweight. Built on this foundation, Baidu has launched downstream products such as historical snapshots, intelligent alerts, and rejection analysis, and is exploring self‑adaptive mechanisms that automatically tolerate and recover from anomalies.

cloud-native observability Metrics Tracing topology log-analysis

Written by

dbaplus Community

Enterprise-level professional community for Database, BigData, and AIOps. Daily original articles, weekly online tech talks, monthly offline salons, and quarterly XCOPS&DAMS conferences—delivered by industry experts.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.