How AI‑Powered Fault Localization Transforms Automated Testing at Scale
This article explores Baidu's intelligent testing practices, covering spectrum‑based root‑cause localization, error‑code driven build‑system diagnostics, revenue‑change stop‑loss decision workflows, and search UI case‑level tracing, illustrating how data, algorithms, and engineering combine to reduce manual effort and accelerate issue resolution.
1. Spectrum‑Based Fault Localization
Test localization aims to quickly identify the cause of a failure after a build or test run, reducing manual investigation time and labor costs. Baidu applies spectrum‑based fault localization, which analyzes execution data (test results, code coverage) to rank suspicious code elements. For each instrumented code block, four metrics are collected: ef (executions in failing tests), ep (executions in passing tests), nf (non‑executions in failing tests), and np (non‑executions in passing tests). These metrics feed suspiciousness formulas such as Tarantula, Ochiai, and Overlab to produce a score that ranks code statements or blocks, enabling developers to focus on high‑risk sections across unit, functional, and diff testing.
2. Build‑System Localization via Error Codes
In large‑scale CI pipelines, abnormal builds generate error codes that can be automatically mapped to concrete error reasons. Baidu defines three strategies: automatic labeling, self‑healing, and issue‑closure. Automatic labeling extracts error codes from logs, matches them against a predefined mapping table, and annotates the failure category, saving manual triage time. The self‑healing strategy triggers predefined recovery actions (e.g., environment restart) when specific error‑code conditions are met, with configurable thresholds for timeout, module, or memory usage. For cases requiring human intervention, an issue card is auto‑created in the tracking system, containing the error code, inferred cause, and suggested remediation; the issue is considered closed only after manual verification, achieving a 94% closure rate in pilot projects.
3. Revenue‑Change Stop‑Loss Decision Localization
To protect commercial revenue, Baidu implements a full‑stack decision workflow consisting of alarm reception, diagnostic analysis, fault‑feature extraction, stop‑loss recommendation, and action plan generation. Alarms are categorized by product line and metric coverage (system stability, user metrics, macro and business process indicators). Diagnostic strategies assess risk level, identify associated risk metrics, pinpoint the exact anomaly time, and perform log‑trace localization to isolate the faulty module. Estimated loss (PV) is calculated, and a stop‑loss recommendation is produced, which can be customized via a strategy library. The end‑to‑end process now supports automatic alarm triggering, diagnosis, and actionable stop‑loss proposals.
4. Search UI Case‑Level Localization
For UI presentation issues, Baidu builds a comprehensive log‑trace system that stores only seed information to conserve resources. A topology map recursively traverses logs to reconstruct request chains, while traffic control and timeout mechanisms ensure safety. The localization logic combines topology data with case‑level alerts, applying regex extraction on logs to identify root causes such as missing resources or unintended deletions. This enables minute‑level alert reception and automated pinpointing of problematic UI cases.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
