Large Language Models Power Big Data SRE Knowledge & Root‑Cause Automation

Facing the growing complexity of big‑data platforms, the SRE team adopted large‑language‑model agents to automate knowledge management and root‑cause analysis, employing Retrieval‑Augmented Generation, a vector store, and the Model Context Protocol to enable intelligent, scalable, and efficient incident diagnosis and resolution.

Instant Consumer Technology Team
Instant Consumer Technology Team
Instant Consumer Technology Team
Large Language Models Power Big Data SRE Knowledge & Root‑Cause Automation

Background

As big‑data platform components become increasingly complex and large‑scale, operations have evolved from manual to tool‑driven, achieving automated delivery and high repeatability. The next stage, “intelligent operations”, requires SREs to improve anomaly detection, root‑cause localization, and capacity forecasting to reduce incident rates and MTTR.

SRE Knowledge Management

The team accumulated many internal documents covering various components, but retrieval was difficult, styles inconsistent, and usage barriers high, leading to reliance on oral communication and production errors. Introducing a Retrieval‑Augmented Generation (RAG) system based on large language models solves this.

Document processing and vectorization: convert documents into semantic vectors to build a knowledge map.

Store vectors in a vector database.

User query vector search: match the query vector to the most relevant document fragment.

Context injection and answer generation: feed the matched fragment and context to the LLM with a prompt to generate a professional answer.

Specific Solution

The team chose the internal PowerAgent platform, which provides model compute and agent‑building capabilities, to quickly implement the RAG system.

Knowledge base organization

大数据SRE知识库
├── flink
│   ├── 线上oncall处理记录
│   │   ├── 2025-08-08 xxxx问题处理
│   │   ├── ···
│   ├── 运营
│   │   ├── 监控links
│   │   ├── 管理links
│   │   ├── ···
│   ├── SOP
│   │   ├── 集群扩容标准化流程
│   │   ├── 计算节点下线标准化流程
│   │   ├── ops工具指令使用
│   │   ├── ···
│   ├── 故障复盘
│   │   ├── ···
│   └── …
├── hdfs
│   └── …

Documents are uploaded, automatically chunked, and stored in the vector DB. Frequently updated operational docs are synchronized daily to keep the knowledge base fresh.

Agent configuration

The agent workflow connects to the vector DB and uses the Qwen‑72B model with a system prompt that forces the assistant to rely on provided context and to admit when information is insufficient.

Root‑Cause Analysis

The core idea is to link a large language model with observability systems via the Model Context Protocol (MCP), creating an AI‑driven root‑cause analysis module that ingests metrics and logs, performs multimodal analysis, and outputs hypotheses, evidence, and remediation suggestions.

Data Collection Layer

To minimize impact on existing monitoring, a unified data collection and storage scheme standardizes metric formats and labeling.

Model Context Protocol (MCP)

MCP is an open‑standard protocol that provides a bridge for LLMs to safely access external services. Two MCP servers are deployed: one for Elasticsearch logs and one for Prometheus metrics.

docker run -d --rm \
  --network=host \
  -e ES_URL="http://x.x.x.x:9200" \
  x.x.x/mcp/elasticsearch-mcp-server:v1.0 http

docker run -d --rm \
  --network=host \
  -e PROMETHEUS_URL="http://x.x.x.x:9190" \
  -e PROMETHEUS_MCP_BIND_HOST="0.0.0.0" \
  -e PROMETHEUS_MCP_BIND_PORT="8081" \
  -e PROMETHEUS_MCP_SERVER_TRANSPORT="http" \
  x.x.x/mcp/prometheus-mcp-server:v1.0

AI Analysis Layer

The LLM receives metrics and log results processed by MCP, and using prompt engineering it identifies abnormal patterns, causal relationships, and root causes.

LLM connects to MCP via a Streamable HTTP protocol offering bidirectional communication and robust recovery.

Prompt example: role as a root‑cause analysis expert, retrieve data via monitor_mcp and log_mcp, return only findings and evidence.

Challenges and Optimizations

Repeated semantic parsing of metrics caused token bloat and performance bottlenecks. Two key optimizations were applied:

Build an indexed metric knowledge base with RAG retrieval to inject only the most relevant definitions, cutting token usage by ~50%.

Refine prompt templates and use dynamic context injection, reducing average analysis time by 20% and improving output consistency.

Future Plans

Future work includes upgrading RAG to GraphRAG for multi‑entity reasoning, expanding MCP server support to more big‑data components, and enabling proactive monitoring with statistical models to achieve a “detect‑analyze‑alert” closed loop.

AIMCPRAGSREknowledge managementroot cause analysis
Instant Consumer Technology Team
Written by

Instant Consumer Technology Team

Instant Consumer Technology Team

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.