How Large Language Model AI Agents Transform Intelligent Operations and On‑Call Support
This article details the design and implementation of a large‑model‑driven intelligent operations dialogue system, covering intent recognition, routing, multi‑agent planning, RAG, workflow, ReAct, reflection, tree‑search techniques, evaluation challenges, and future multi‑agent collaboration for on‑call support.
Background Overview
Enterprise systems are becoming increasingly complex, demanding robust observability platforms. eBay's Sherlock team builds a company‑wide monitoring platform that ingests metrics, logs, traces, and events, providing alerting, anomaly detection, and root‑cause analysis. On‑call engineers handle repetitive, high‑frequency support requests such as API usage queries, PromQL construction, missing metrics, alert failures, and resource quota adjustments.
Challenges in On‑Call Support
High repetition: many requests can be answered from documentation.
High response cost: manual handling consumes developer time.
Low knowledge transfer efficiency: new users struggle to learn the platform.
AI Agent Solution
To address these challenges, the team introduces AI Agents powered by large language models (LLMs). Compared with traditional FAQ or rule‑based systems, AI Agents offer stronger semantic understanding, code analysis, and dynamic Q&A capabilities for complex support scenarios.
1. Retrieval‑Augmented Generation (RAG)
RAG splits internal documents, stores them in a vector database, and retrieves semantically relevant chunks as context for the LLM, improving answer accuracy for internal knowledge queries.
2. Workflow
Beyond RAG, a workflow mechanism maps user requests to predefined process graphs, routing events and metrics queries to appropriate knowledge bases or modules.
3. AI Agents
Agents are built on LangChain and extended with tool‑calling, planning, memory, and environment feedback. They support multi‑step reasoning, tool invocation, and dynamic execution.
4. ReAct (Reasoning and Acting)
ReAct introduces a “think‑then‑act” paradigm: the agent first performs explicit reasoning, outputs a reasoning trace, then executes the corresponding action, improving decision accuracy in multi‑step problems.
5. Reflection
After each task, the agent reviews its actions, learns from failures (e.g., tool‑call errors) and successes, storing effective execution traces for future reuse, shifting from static prompts to experience‑driven behavior.
6. Tree of Thought (ToT)
ToT expands reasoning from linear chains to tree structures, allowing parallel exploration of multiple solution paths and self‑consistency voting to enhance response stability.
7. Monte Carlo Tree Search (MCTS)
MCTS integrates reinforcement‑learning style scoring into ToT, evaluating each node with a scoring model or reward function to dynamically optimize reasoning paths.
Solution Architecture
1. Intent Recognition
The system parses user inputs, handling domain‑specific terms (e.g., "sherlock", "Raptor", "platform metrics") using a private vocabulary and entity extraction, then enriches the query with RAG to resolve ambiguities.
2. Routing
After intent detection, requests are routed to specialized agents (e.g., Q&A Agent for documentation lookup, diagnostic agents for data loss). A two‑stage recall‑plus‑ranking approach first recalls candidate agents based on keywords, then uses few‑shot LLM prompting with chain‑of‑thought and self‑consistency to rank them.
3. Planning
Initially the team attempted autonomous planning but found human‑crafted SOPs (Standard Operating Procedures) more reliable. SOP libraries encode common diagnostic flows; when no SOP matches, the LLM dynamically generates a plan.
4. Controlled Execution
The final execution uses a single‑path best‑first strategy: at each step the LLM selects the most promising tool, guided by self‑consistency voting, avoiding costly tree‑search while maintaining effectiveness.
5. Scalability via MCP
The platform adopts a Model‑Context‑Protocol (MCP) architecture. MCP servers manage tool registration and invocation, while MCP clients act as agents. This plug‑and‑play design enables easy addition of new capabilities (e.g., metric analysis, anomaly detection) and decouples agents from specific tools.
Future Outlook
1. Evaluation Uncertainty
Assessing the system’s impact is difficult due to non‑deterministic user feedback and multi‑turn interactions. The team seeks community input on robust evaluation metrics for such weak‑signal, real‑world tasks.
2. Multi‑Agent Collaboration
Current centralized routing limits scalability across organizational teams. The team envisions a Unified Support Agent Framework where agents can hand off context to other agents, enabling cross‑team coordination and reducing “ticket bouncing.”
3. Ongoing Research
Explorations include advanced multi‑agent orchestration, dynamic SOP generation, and cost‑effective high‑order search methods to balance performance with API expense.
Overall, the system integrates intent detection, RAG, workflow routing, ReAct reasoning, reflection, tree‑search, and scalable MCP infrastructure to build a controllable, extensible AI‑driven operations support platform.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
DataFunSummit
Official account of the DataFun community, dedicated to sharing big data and AI industry summit news and speaker talks, with regular downloadable resource packs.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
