How AI Transforms Physical Network Fault Localization: From Manual to LLM‑Powered Precision

This article explains how Baidu Cloud evolved its physical network fault‑location workflow—from manual analysis and integrated multi‑signal algorithms to AI‑driven reasoning with large language models—highlighting structured prompting, multi‑agent collaboration, and measurable improvements in accuracy and automation.

Baidu Intelligent Cloud Tech Hub
Baidu Intelligent Cloud Tech Hub
Baidu Intelligent Cloud Tech Hub
How AI Transforms Physical Network Fault Localization: From Manual to LLM‑Powered Precision

1 Physical Network Fault Localization Development History

In physical networks, a device failure can trigger a cascade of abnormal metric alerts, making rapid root‑cause identification akin to finding a needle in a haystack for operations teams.

Over years of network operations, Baidu Cloud built various monitoring platforms:

White‑box monitoring: fault discovery and location based on switch logs.

Black‑box monitoring: fault discovery via self‑diagnosis.

Multi‑plane monitoring: backbone‑plane level alerts and fault location.

Traffic monitoring: fault discovery from traffic spikes.

Transmission monitoring: alerts for transport network.

Change‑order platform: query change records.

AAA audit platform: identity authentication and audit.

1.1 Manual Positioning

Each platform operated independently, lacking strong correlation. Assuming an 80% accuracy for a single platform, the combined probability rises to 96% when two unrelated platforms agree, and to 99.2% with three, so aggregating results greatly improves accuracy.

When a single platform could not provide clear positioning, engineers manually collected data from all platforms and used their experience to make a judgment.

1.2 Integrated Positioning

In early 2024, Baidu Cloud automated the "discover‑locate‑mitigate" workflow with the "Houyi" platform, which aggregates signals from white‑box, black‑box, traffic, transmission, change‑order, trace 2.0, multi‑plane, etc., and runs algorithms for comprehensive positioning, achieving higher accuracy and enabling automatic loss‑mitigation.

The integrated approach also supports automatic mitigation, drastically reducing the impact time of business faults.

1.3 AI Positioning

Although integrated positioning improves precision, it has limitations: increasing logical complexity, higher maintenance cost, difficulty adding new logic, and lack of transparent reasoning for operators.

Large language models (LLMs) excel at reasoning and analysis, offering several advantages when introduced into fault localization:

Powerful inference can identify the most probable faulty device or link from diverse signals.

LLMs can provide detailed reasoning, explaining why a particular device is inferred as faulty.

Maintenance and evolution become easier; incorrect reasoning can be quickly adjusted and redeployed.

Prompt testing is straightforward in platforms like Wenxin Yiyan.

In the Houyi platform, structured prompts and multi‑agent techniques are used to iteratively refine LLM reasoning.

2 AI‑Based Network Fault Localization Practice

Currently, Baidu Cloud employs the Ernie‑4.0‑8k model for AI positioning, presenting both integrated and AI results for comparison.

Data preprocessing includes normalizing alerts, removing duplicates, and assigning weights (e.g., lowering weight for frequent CRC alerts).

2.1 Structured Prompt Application

A structured prompt template includes:

Role: You are a network monitoring and analysis expert skilled at identifying faulty devices or fiber‑optic failures from alert signals.
Task: Given a set of device‑failure alerts, find the faulty device or fiber‑optic failure.
Reward: High performance bonuses and 100× payment for each successful location.
Input format example: white_box_event,HD-M2NJ-111111.Int,traffic drop
Rules: prioritize alert types, count distinct types per device, treat duplicate types as one, ensure a single conclusion.
Output format: Faulty device: {device} (or include reasoning if needed).

2.2 AI Positioning Example

Using the structured prompt and processed signals, the Ernie model produces fault‑location results, illustrated in the following figure:

2.3 Comparison of AI and Integrated Positioning

Daily alert tracking shows AI positioning can identify additional faulty components compared to integrated positioning. For example, case 171728:

Integrated positioning identified a leaf device; AI positioning also identified a spine device, revealing the true link‑level fault.

2.4 Letting LLM Explain Its Reasoning

By requesting the LLM to output both the fault device and its inference logic, operators obtain transparent reasoning, which aids further optimization.

2.5 Multi‑Agent Assisted Positioning

A dual‑agent setup uses Ernie as the primary LLM and Llama‑2‑70b as an auxiliary agent. The workflow:

Ernie produces result L1.

Llama‑2 produces result L2 and reasoning R1.

If L1 equals L2, return the result.

Otherwise, feed L2 and R1 back to Ernie for a refined result L3.

Return L3.

Examples of this process are shown in the figures below:

3 Summary and Outlook

Baidu Cloud has successfully introduced AI‑driven fault localization into backbone network quality monitoring and is extending it to data‑center physical networks, gateway fault detection, and beyond.

Future enhancements include incorporating network topology information, adding more alert sources, and further refining LLM reasoning for even higher precision.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

AIAutomationFault Localizationlarge language modelmulti-agent
Baidu Intelligent Cloud Tech Hub
Written by

Baidu Intelligent Cloud Tech Hub

We share the cloud tech topics you care about. Feel free to leave a message and tell us what you'd like to learn.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.