Industry Insights 14 min read

How AI Is Revolutionizing Physical Network Fault Localization

This article explains how Baidu Cloud evolved from manual and integrated network fault detection to AI-driven localization using large language models, detailing structured prompting, multi‑agent workflows, and real‑world comparisons that demonstrate improved accuracy and faster mitigation.

Baidu Geek Talk

Jul 15, 2024

How AI Is Revolutionizing Physical Network Fault Localization

Physical network devices can fail and trigger a cascade of abnormal metric alerts, making rapid root‑cause identification a needle‑in‑a‑haystack challenge for operations teams.

To address this, Baidu Cloud has built a suite of monitoring platforms over years, including white‑box log analysis, black‑box self‑discovery, multi‑plane backbone monitoring, traffic anomaly detection, transport monitoring, change‑order tracking, and AAA audit. Each platform independently detects faults, but their data are isolated, limiting overall accuracy.

1. Development of Physical Network Fault Localization

1.1 Manual Localization

When a single platform’s accuracy is assumed at 80%, combining two independent platforms raises confidence to 96%, and three platforms to 99.2%. Historically, operators manually aggregated signals from all platforms and applied expert judgment to achieve reliable fault pinpointing.

1.2 Integrated Localization

In early 2024 Baidu Cloud launched the “Houyi Fault Localization” platform, which aggregates signals from white‑box, black‑box, traffic, transport, change‑order, Trace 2.0, and multi‑plane monitors. An algorithm performs “integrated localization,” automating fault detection and boosting accuracy. The improved precision also enabled an “automatic mitigation” capability that quickly isolates business‑impacting failures.

1.3 AI Localization

Although integrated localization greatly improves accuracy, it brings challenges: exponential growth in logical complexity, higher maintenance burden, difficulty adding new logic, and a lack of transparent reasoning for operators. Large language models (LLMs) excel at inference and explanation, making them attractive for fault localization.

LLMs can reason across heterogeneous signals to pinpoint the most likely faulty device or link.

They can provide detailed reasoning, explaining why a particular fault is inferred.

Prompt adjustments are fast, allowing quick strategy tweaks without code changes.

Prompt testing can be done directly in LLM playgrounds such as Wenxin Yiyan.

2. AI‑Based Network Fault Localization Practice

Baidu Cloud currently uses the Ernie‑4.0‑8k model on the Qianfan platform, presenting both integrated and AI localization results for side‑by‑side comparison.

2.1 Structured Prompt Engineering

The team designed a structured prompt template that defines the AI’s role, task, reward, input format, priority rules, and output format.

Role: You are a network monitoring and analysis expert capable of identifying faulty devices or fiber links from alarm data. Task: Given a set of alarm records, find the faulty device or fiber link. Reward: End‑of‑year performance bonus and 100× payment per successful localization. Input format (example): white_box_event,HD-M2NJ-111111.Int,flow_drop Priority rules: 1) Higher‑priority alarm types indicate higher fault likelihood; 2) Multiple distinct alarm types on the same device increase confidence; 3) Duplicate alarms of the same type count only once; … 7) Always output a final conclusion. Output format: Fault Device: {device} or, with reasoning, Fault Device: {device} Reasoning: {logic}

2.2 Example AI Localization Result

Using the structured prompt and processed alarm signals, the Ernie model produces a fault diagnosis, illustrated in the figure below.

2.3 AI vs Integrated Localization Comparison

Daily alarm tracking shows that AI localization can identify additional faulty components. In case #171728, integrated localization flagged a leaf device, while AI also detected a spine device, matching the true fault chain between the leaf and spine.

2.4 Getting LLM Reasoning Logic

By requesting the model to output its reasoning, operators receive explanations such as “Device BD‑XXXXXXXXX‑SC‑37.Int shows flow drop and white‑box logs, with no conflicting transport alerts, indicating a high fault probability.”

2.5 Multi‑Agent Assisted Localization

The workflow employs a primary agent (Ernie‑4.0‑8k) and a secondary agent (Llama 2 70B). The steps are:

Ernie produces an initial result L1.

Llama 2 generates result L2 and reasoning R1.

If L1 equals L2, return the result.

Otherwise, feed L2 and R1 back to Ernie for a refined result L3.

Return L3 as the final diagnosis.

This approach leverages divergent model perspectives to improve accuracy, as demonstrated in the example figures.

3. Summary and Outlook

AI‑driven fault localization is now deployed in Baidu Cloud’s backbone network monitoring, delivering measurable improvements. The team plans to extend the capability to data‑center physical networks and gateway fault detection, and to further enhance accuracy by incorporating topology data and additional alarm sources.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

AI Fault Localization Large Language Model infrastructure multi‑agent Network Operations

Written by

Baidu Geek Talk

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.