How LLMs Accelerate Root‑Cause Diagnosis in Large‑Scale Microservices
By abstracting a massive microservice system as a dynamic multi‑layer graph and integrating large language models, the article outlines three evolution stages—from manual expert debugging to rule‑based AIOps and finally LLM‑driven cognitive reasoning—detailing practical workflows, context engineering, and real‑world case studies that dramatically improve MTTR and accuracy.
Problem Abstraction
We model a large‑scale microservice system as a time‑varying multi‑layer graph G(t) = (V, E). Nodes V represent physical and logical entities (data‑center, host, container/pod, business service). Edges E capture two relationships:
Deployment : which container runs on which host.
Invocation : RPC/HTTP calls between services.
Each node and edge carries multimodal observation data M such as call counts, success rates, latency distributions, CPU, memory, load, error logs, exception stacks, configuration changes, and deployment records. The topology and state evolve over time due to container migration, scaling, and dependency changes.
Root‑cause localization is defined as: given an anomaly symptom S (e.g., error‑rate spike at time t0), find the earliest faulty entity or relationship R in G(t) whose propagation can plausibly explain all observed anomalies.
Three major uncertainties make this problem hard:
Topology uncertainty : Traces are incomplete, sampled, or missing for legacy services, so we often have only a partial graph.
Propagation uncertainty : Downstream services may alert while upstream services appear normal (and vice‑versa); synchronous and asynchronous dependencies behave differently, each with its own timeout and anomaly thresholds.
Root‑cause ambiguity : The first failing entity or the bottom‑most node in the topology is not necessarily the true root cause; bottlenecks can generate false downstream symptoms.
Thus microservice root‑cause analysis is causal inference on an incomplete, multi‑layer, multimodal graph using limited topology, metrics, logs, and change data.
Method Evolution
Stage 1 – Manual Expert Diagnosis
Receive an alert and identify the initial symptom.
Drill horizontally along call relationships and vertically along deployment relationships, using traces when available or aggregating RPC metrics otherwise.
Correlate logs, change records, and domain experience to form a hypothesis.
Pros : Handles unseen edge cases.
Cons : High MTTR, heavily dependent on individual knowledge, and prone to information overload in cascade failures.
Stage 2 – Rule‑Driven AIOps
Encode expert experience into executable rule chains:
Summarize common fault patterns from historical analysis.
Translate diagnostic logic into rules (call‑chain drilling, resource correlation, time‑correlation, topology‑propagation weighting).
Run the rules automatically via a rule engine.
Limitations : High maintenance cost, exponential rule growth, difficulty covering edge cases, rule conflicts, and lack of genuine reasoning.
Stage 3 – LLM‑Based Cognitive Reasoning
Large language models (LLMs) provide semantic understanding of metrics, logs, and change records, can emulate senior engineers’ reasoning, and support zero‑shot/few‑shot inference for novel issues. In our microservice environment we have adopted this stage.
Real‑World LLM‑Powered Cases
1. Serialization / Deserialization Errors
Symptom: Return code –1 and vector‑type mismatch in logs.
LLM conclusion: JCE deserialization failure due to vector type mismatch.
2. Unknown Framework Errors
Symptom: Return code –99 and a Python stack trace in a Serverless script.
LLM conclusion: Failure originates from script logic, not a generic “unknown error”.
3. Infrastructure & Network Issues
Symptoms: Container overload/OOM, host crash, intermittent packet loss.
LLM analysis: Single host failure causing localized impact rather than a global network problem.
4. MySQL / HTTP Errors
LLM leverages world knowledge to explain error codes without a handcrafted knowledge base.
Architecture: Real‑Time Workflow + Post‑Mortem Agent
Initial generic Agents architecture: CodeAct drives the LLM to plan and invoke tools.
MCP Server provides data‑access capabilities (monitoring, call graphs, logs, aggregation).
The LLM iteratively calls MCP services to build an analysis path.
Real‑time challenge : Root cause must be identified within the alert window (≈30 s). Only ultra‑fast models (e.g., gemini‑2.0‑flash) meet this latency.
For a high‑frequency, fixed‑pattern scenario like microservice alert analysis, letting the LLM plan every step wastes resources.
We therefore introduced “Context Engineering” with two modes:
Workflow mode : Pre‑designed context acquisition; the LLM receives a complete snapshot and performs a single inference – suitable for real‑time alerts.
Agent mode : Dynamic context fetching; the LLM can call MCP services freely for deeper, post‑mortem investigations.
Real‑Time Workflow
Collect all core data in one shot:
Alert basics (service, interface, time window).
RPC call metrics (caller/callee dimensions, error‑code distribution, latency).
Downstream resource metrics (CPU, memory, load, container status, rack topology).
Deduplicated critical logs.
Basic health statistics of data‑center/network.
The model outputs a JSON with a concise root‑cause summary and detailed evidence, enabling a “one‑question‑one‑answer” interaction that keeps latency low.
Post‑Mortem Agent
If the workflow cannot determine a root cause, the Agent mode is triggered.
The Agent fetches longer‑term logs, compares multiple deployments, and can retrieve configuration or code from repositories.
It acts as a “second‑line expert” for thorough analysis.
Feeding the LLM “Good Grain”
1. Log Deduplication
Massive duplicate logs are removed using classic string edit‑distance combined with regex filtering (e.g., stripping Base64). This is more efficient than embedding‑based similarity for structured logs.
2. Handling Trace Gaps
When traces are missing, we intersect available trace data with module‑level failure metrics; if traces are absent, we perform fuzzy matching based on interface similarity, traffic volume, time‑window alignment, and error‑type consistency.
3. Dynamic “Behavioral Timeout” Thresholds
Instead of static config thresholds (e.g., 500 ms), we compute a “behavioral threshold” T from normal‑operation latency distributions (e.g., 99.95 % of requests ≤ T). During an alert, any significant shift beyond T is flagged as abnormal, even if still below the static limit.
Controlling Hallucinations
1. Floating‑Point Trap
LLMs struggle with precise floating‑point comparisons. We round numbers to integers before feeding them to the model, reducing comparison errors.
2. Confusing Correlation with Causation
We changed the LLM persona from “SRE engineer” (prone to bias toward high load) to “Bayesian‑thinking statistician”, forcing the model to validate causality with evidence (e.g., failure distribution across nodes).
Results and Outlook
After iterative deployment, the system achieved:
29 % reduction in fault count.
67 % reduction in MTTR.
>80 % root‑cause accuracy on manually labeled samples.
Significant cuts in analysis steps and system switches.
Outputs now include natural‑language reasoning rather than raw metrics, and visualizations lower engineers’ cognitive load.
Future work focuses on:
Speeding up the real‑time workflow (model selection, parallel data collection).
Enhancing the Agent’s intelligence (richer toolset, better RAG integration).
Further refining context‑engineering for both fast and deep analyses.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Huya Tech Engineering
Official Huya Tech account. Technical insights, engineering practice, and frontier innovation all in one place.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
