How Baidu’s Lingxi Agent Uses LLMs to Automate Network Fault Diagnosis
This article details Baidu's evolution from manual network fault analysis to a multi‑agent AI platform, describing how the Lingxi intelligent agent leverages large language models, MCP tools, and design patterns to automate latency queries, generate analysis reports, and integrate with existing monitoring services.
1. A Typical Operations Scenario
Operators often ask, “Is there a latency issue for job‑0bb39798bbfe6a4e at 2:30 AM?” They log into a high‑performance network latency platform, input the task ID and time, view a heatmap, and manually interpret whether any latency anomalies exist.
The heatmap shows low latency within the same TOR group (diagonal squares) and slightly higher latency across TOR groups, with even larger delays across rooms, indicating no obvious abnormal latency. Interpreting this requires experienced staff familiar with network topology.
Now, by entering the same query into the Lingxi intelligent agent, the system automatically performs the analysis and returns a comprehensive report.
2. Baidu Network Operations Evolution
2.1 Manual Period
Initially, fault handling relied on manual inspection of devices and scripts written from experience, resulting in long detection and resolution times.
2.2 Platform Construction Period
With rapid cloud growth, multiple monitoring platforms were built, covering both white‑box (device logs, metrics, alerts) and black‑box (ping mesh, radar scans) methods. Over a dozen platforms now provide timely alerts and integrate with an automatic mitigation system.
White‑box: logs, device metrics, traffic monitoring, high‑performance server metrics.
Black‑box: ping mesh, radar scans for backbone, data‑center, gateway monitoring, special scenarios (packet modification, cut‑over), and performance analysis.
2.3 Large‑Model and Agent Exploration Period
In 2024, Baidu introduced large‑model capabilities for comprehensive fault localization, followed in 2025 by specialized agents for network management, aiming to provide smarter, more convenient operations.
Deep analysis of various scenarios using collected data and LLMs to generate detailed reports.
Long‑tail fault analysis where LLMs, combined with platform data, pinpoint hard‑to‑detect issues.
Network‑fault portal allowing non‑technical users to query via natural language, with LLM‑generated explanations and visual results.
3. Lingxi Agent Architecture
3.1 LLM Selection
The agent is model‑agnostic but currently runs on Baidu Qianfan’s OpenAI‑compatible API, using DeepSeek‑v3 as the primary reasoning model, ERNIE‑4.5‑turbo‑128k for specific scenarios, and a tao‑8k embeddings model for RAG.
3.2 Design Patterns
The agent combines several proven patterns:
Multi‑Agent pattern : multiple agents with distinct roles cooperate via A2A.
ReAct pattern : iterative reasoning, action, observation loop.
Planning pattern : LLM generates step‑by‑step plans.
Tool Use pattern : LLM invokes external MCP tools (e.g., latency monitoring, topology queries).
These patterns are orchestrated to process a user query such as the latency task example, generating a plan, calling the appropriate MCP tool, receiving data, and producing a final analysis.
3.3 Frontend Development
The frontend is built with Comate Zulu. Complex visualizations are generated by first letting the LLM produce JSON data, then filling a web template to render a polished page within seconds, avoiding the high token cost of full HTML generation.
3.4 Example Think Tool (Go)
package mcp
import (
"context"
"github.com/kataras/golog"
"github.com/mark3labs/mcp-go/mcp"
)
func (s *Server) thinkAndPlan(_ context.Context, req mcp.CallToolRequest) (*mcp.CallToolResult, error) {
golog.Infof("thinkAndPlan report start")
defer golog.Infof("thinkAndPlan report end")
return mcp.NewToolResultText("think_stop"), nil
}
s.Server.AddTool(mcp.NewTool("think_and_plan",
mcp.WithDescription("这是用于系统化思考与规划的工具,支持用户在面对复杂问题或任务时,分阶段梳理思考、规划和行动步骤。工具强调思考(thought)、计划(plan)与实际行动(action)的结合,通过编号(thoughtNumber)追踪过程。该工具不会获取新信息或更改数据库,只会将想法附加到记忆中。当需要复杂推理或某种缓存记忆时,可以使用它。"),
mcp.WithString("thought", mcp.Required(), mcp.Description("当前的思考内容,可以是对问题的分析、假设、洞见、反思或对前一步骤的总结。强调深度思考和逻辑推演,是每一步的核心。")),
mcp.WithString("plan", mcp.Required(), mcp.Description("针对当前任务拟定的计划或方案,将复杂问题分解为多个可执行步骤。")),
mcp.WithString("action", mcp.Required(), mcp.Description("基于当前思考和计划,建议下一步采取的行动步骤,要求具体、可执行、可验证,可以是下一步需要调用的一个或多个工具。")),
mcp.WithString("thoughtNumber", mcp.Required(), mcp.Description("当前思考步骤的编号,用于追踪和回溯整个思考与规划过程,便于后续复盘与优化。")),
s.thinkAndPlan))3.5 MCP Tool Configuration Example
{
"lingxi_basic": "http://xxx.xxx.xxx.xxx:8890/sse",
"3a": "http://xxx.xxx.xxx.xxx:8889/sse",
"cr7-server": "http://xxx.xxx.xxx.xxx:8090/sse",
"cover-server": "http://xxx.xxx.xxx.xxx:8393/sse"
}3.6 Agent Integration Example
[
{"name": "Knowledge_Base_Query_Agent", "address": "http://xxx.xxx.xxx.xxx:8889/knowledge_agent"},
{"name": "Web_Search_Agent", "address": "http://xxx.xxx.xxx.xxx:8899/websearch_agent"},
{"name": "Customer_Support_Agent", "address": "http://xxx.xxx.xxx.xxx:8787/customer_support_agent"},
{"name": "Data_Analysis_Agent", "address": "http://xxx.xxx.xxx.xxx:8000/customer_support_agent"}
]4. Outlook
4.1 A2A (Agent‑to‑Agent)
Anthropic’s open A2A protocol enables agents to interoperate, allowing Lingxi to delegate network‑fault queries to specialized agents, reducing integration friction across teams.
4.2 RAG and Other Technologies
Future work includes exploring Retrieval‑Augmented Generation (RAG) and further scaling of multi‑agent collaborations.
5. Conclusion
From the 2009 “box computing” concept to today’s AI‑enhanced operations, Baidu’s Lingxi intelligent agent demonstrates how large models can empower network fault diagnosis, providing faster, more accurate, and user‑friendly solutions.
Baidu Intelligent Cloud Tech Hub
We share the cloud tech topics you care about. Feel free to leave a message and tell us what you'd like to learn.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
