Operations 15 min read

Automated and Intelligent Analysis of Baidu Search Stability Issues

The team automated Baidu Search fault diagnosis by building a side‑index for instant log lookup, streaming incremental analysis, exhaustive rule templates, feature‑engineering pipelines, query‑scene reconstruction, entropy‑based ranking, per‑second timeline views, and chaos‑engineered fault injection, achieving near‑99% accuracy and second‑level, module‑granular stability tracing.

Baidu Geek Talk

Jul 5, 2021

Automated and Intelligent Analysis of Baidu Search Stability Issues

This article continues the story of Baidu Search stability analysis, focusing on how the team automated and intelligentized fault diagnosis to improve the efficiency of problem tracing.

It first outlines eight major challenges that must be solved within seconds: fast log retrieval, balancing real‑time analysis with accuracy, comprehensive fault description, feature engineering, query‑scene reconstruction, cascade‑fault perception, deep feature mining, and handling unknown faults.

Challenge 1: Achieving rapid log search by pushing a subset of logs to a side‑index module, enabling O(1) retrieval of log locations.

Challenge 2: Resolving the trade‑off between real‑time analysis and completeness of logs through incremental, stream‑based analysis that triggers on every new log fragment.

Challenge 3: Providing a complete and systematic representation of diverse fault rules.

Challenge 4: Designing a feature‑engineering pipeline that extracts presence/value features from raw logs and maps them to fault reasons.

Challenge 5: Reconstructing the full query execution tree by sorting span IDs to recover the exact order of primary and retry requests.

Challenge 6: Using an entropy‑based ranking algorithm to surface dimensions with strong aggregation of fault occurrences.

Challenge 7: Implementing a timeline mechanism that aggregates fault counts per second and visualizes their evolution.

Challenge 8: Applying chaos engineering to inject controlled failures, thereby generating labeled samples for unknown fault detection.

Based on these challenges, the article describes eight concrete techniques:

1. Index Mirroring : Online collectors push logs to a side‑index keyed by queryID, allowing instant location lookup.

2. Streaming Analysis : Automatic, incremental analysis is triggered by any new rejection signal or log update.

3. Complete Label Set : An exhaustive template of possible module‑level failure reasons is built and applied across all mandatory modules.

4. Feature Engineering Engine : Rule‑based extraction converts raw logs into binary or numeric features, which are then represented as vectors for matching against fault rules.

5. Single‑Query Scene Reconstruction : By correlating span_id information across modules, the full dispatch tree is rebuilt and abnormal logs are aggregated along the path.

6. Intelligent Ranking : Entropy‑driven scoring ranks dimensions with the strongest fault clustering, aiding root‑cause identification.

7. Timeline Analysis : A per‑second aggregation view shows the count and trend of each rejection reason, supporting rapid diagnosis.

8. Chaos Engineering : Controlled fault injection enriches the knowledge base with labeled samples, improving detection of previously unseen issues.

The article also covers two additional topics:

Long‑Tail Batch Analysis : By periodically extracting tail‑latency queries, traversing their call‑graphs breadth‑first, and pinpointing the last abnormal module, the system isolates the root cause of latency outliers.

Full‑Process Abnormal State Tracking : Correlating queries that hit dirty caches with those that bypass caches enables end‑to‑end tracing of abnormal states, allowing the team to identify stable, reproducible faults.

In summary, the combination of observability foundations (logging, tracing, metrics) and the eight advanced techniques yields a fault‑analysis accuracy of up to 99% and enables second‑level, module‑granular diagnosis, dramatically improving Baidu Search’s availability and user experience.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

stream processing Observability chaos engineering fault-analysis log indexing Search Stability

Written by

Baidu Geek Talk

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.