How Large Language Model AI Agents Transform Intelligent Operations and On‑Call Support

This article details the design and implementation of a large‑model‑driven intelligent operations dialogue system, covering intent recognition, routing, multi‑agent planning, RAG, workflow, ReAct, reflection, tree‑search techniques, evaluation challenges, and future multi‑agent collaboration for on‑call support.

DataFunSummit
DataFunSummit
DataFunSummit
How Large Language Model AI Agents Transform Intelligent Operations and On‑Call Support

Background Overview

Enterprise systems are becoming increasingly complex, demanding robust observability platforms. eBay's Sherlock team builds a company‑wide monitoring platform that ingests metrics, logs, traces, and events, providing alerting, anomaly detection, and root‑cause analysis. On‑call engineers handle repetitive, high‑frequency support requests such as API usage queries, PromQL construction, missing metrics, alert failures, and resource quota adjustments.

Challenges in On‑Call Support

High repetition: many requests can be answered from documentation.

High response cost: manual handling consumes developer time.

Low knowledge transfer efficiency: new users struggle to learn the platform.

AI Agent Solution

To address these challenges, the team introduces AI Agents powered by large language models (LLMs). Compared with traditional FAQ or rule‑based systems, AI Agents offer stronger semantic understanding, code analysis, and dynamic Q&A capabilities for complex support scenarios.

1. Retrieval‑Augmented Generation (RAG)

RAG splits internal documents, stores them in a vector database, and retrieves semantically relevant chunks as context for the LLM, improving answer accuracy for internal knowledge queries.

2. Workflow

Beyond RAG, a workflow mechanism maps user requests to predefined process graphs, routing events and metrics queries to appropriate knowledge bases or modules.

3. AI Agents

Agents are built on LangChain and extended with tool‑calling, planning, memory, and environment feedback. They support multi‑step reasoning, tool invocation, and dynamic execution.

4. ReAct (Reasoning and Acting)

ReAct introduces a “think‑then‑act” paradigm: the agent first performs explicit reasoning, outputs a reasoning trace, then executes the corresponding action, improving decision accuracy in multi‑step problems.

5. Reflection

After each task, the agent reviews its actions, learns from failures (e.g., tool‑call errors) and successes, storing effective execution traces for future reuse, shifting from static prompts to experience‑driven behavior.

6. Tree of Thought (ToT)

ToT expands reasoning from linear chains to tree structures, allowing parallel exploration of multiple solution paths and self‑consistency voting to enhance response stability.

7. Monte Carlo Tree Search (MCTS)

MCTS integrates reinforcement‑learning style scoring into ToT, evaluating each node with a scoring model or reward function to dynamically optimize reasoning paths.

Solution Architecture

1. Intent Recognition

The system parses user inputs, handling domain‑specific terms (e.g., "sherlock", "Raptor", "platform metrics") using a private vocabulary and entity extraction, then enriches the query with RAG to resolve ambiguities.

2. Routing

After intent detection, requests are routed to specialized agents (e.g., Q&A Agent for documentation lookup, diagnostic agents for data loss). A two‑stage recall‑plus‑ranking approach first recalls candidate agents based on keywords, then uses few‑shot LLM prompting with chain‑of‑thought and self‑consistency to rank them.

3. Planning

Initially the team attempted autonomous planning but found human‑crafted SOPs (Standard Operating Procedures) more reliable. SOP libraries encode common diagnostic flows; when no SOP matches, the LLM dynamically generates a plan.

4. Controlled Execution

The final execution uses a single‑path best‑first strategy: at each step the LLM selects the most promising tool, guided by self‑consistency voting, avoiding costly tree‑search while maintaining effectiveness.

5. Scalability via MCP

The platform adopts a Model‑Context‑Protocol (MCP) architecture. MCP servers manage tool registration and invocation, while MCP clients act as agents. This plug‑and‑play design enables easy addition of new capabilities (e.g., metric analysis, anomaly detection) and decouples agents from specific tools.

Future Outlook

1. Evaluation Uncertainty

Assessing the system’s impact is difficult due to non‑deterministic user feedback and multi‑turn interactions. The team seeks community input on robust evaluation metrics for such weak‑signal, real‑world tasks.

2. Multi‑Agent Collaboration

Current centralized routing limits scalability across organizational teams. The team envisions a Unified Support Agent Framework where agents can hand off context to other agents, enabling cross‑team coordination and reducing “ticket bouncing.”

3. Ongoing Research

Explorations include advanced multi‑agent orchestration, dynamic SOP generation, and cost‑effective high‑order search methods to balance performance with API expense.

Overall, the system integrates intent detection, RAG, workflow routing, ReAct reasoning, reflection, tree‑search, and scalable MCP infrastructure to build a controllable, extensible AI‑driven operations support platform.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

AI AgentsRAGIntelligent Operationsmulti-agent collaboration
DataFunSummit
Written by

DataFunSummit

Official account of the DataFun community, dedicated to sharing big data and AI industry summit news and speaker talks, with regular downloadable resource packs.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.