How Microsoft Leverages LLMs to Auto‑Generate Cloud Incident Root Causes and Fixes
Microsoft researchers fine‑tuned GPT‑3.x models with LoRA on over 40,000 cloud incident records, evaluated them with six NLP metrics and human interviews, and found that LLMs can generate root‑cause analyses and mitigation steps comparable to BERT models, especially for machine‑detected failures.
Background : Major cloud providers such as Amazon, Google and Microsoft suffer costly outages—Amazon loses about $100 million per hour of downtime. Rapid, accurate incident resolution is therefore critical. Microsoft researchers applied large language models (LLMs) to automatically recommend root causes and mitigation steps for cloud incidents.
Method : Historical incident records (title, abstract, root cause, solution) were used as input. The title and abstract are fed to an LLM to generate the root cause and solution. Two encoder‑decoder baselines (RoBERTa, CodeBERT) and four GPT‑3.x models were evaluated: Curie (6.7 B parameters, natural‑language data), Codex (12 B, natural language + code), Davinci (175 B, natural language), and Code‑davinci‑002 (GPT‑3.5, 175 B, natural language + code). All GPT‑3.x models were fine‑tuned with LoRA for 2,000 steps (4 epochs). Training hardware: Curie and Codex on a single V100; Davinci on four V100 GPUs; Code‑davinci‑002 on four A100 GPUs.
Dataset and Evaluation : Incidents from Microsoft cloud services between 2018‑01‑01 and 2022‑07‑15 were collected. For root‑cause prediction: 35,820 training, 3,000 testing, 2,000 validation examples; for solution generation: 5,455 training, 2,000 testing, 500 validation examples. Six NLP metrics—BLEU‑4, ROUGE‑L, METEOR, BERTScore, BLEURT, NUBIA—measure similarity to reference texts. Additionally, 50 recent incidents were manually reviewed by the incident owners, who scored generated outputs for correctness and readability.
Research Questions and Findings :
1) Can fine‑tuned GPT‑3.x models effectively generate root causes? The GPT‑3.x models performed on par with BERT‑based models because many incident descriptions contain generic sentences (e.g., “There is a bug in the code”), which BERT can simply copy to boost scores.
2) Can they generate effective solutions? Results mirror the root‑cause case; BERT models copy common solution templates (“the issue is self‑mitigated”, “fix deployed to all regions”), yielding similar performance to GPT‑3.x.
3) How does fine‑tuning compare to zero‑shot? Zero‑shot GPT‑3.x achieved BLEU‑4 scores of 0.80–2.18, whereas fine‑tuned models reached 5.47–6.76, a substantial improvement.
4) Does multi‑task learning (joint root‑cause + solution training) improve performance? No significant gain was observed because the two tasks lack strong inter‑dependency.
5) If the root cause is known, can GPT‑3.x suggest better solutions? Yes—accurate root‑cause identification leads to higher‑quality solution recommendations.
6) Are models better at machine‑detected or human‑detected incidents? Machine‑detected incidents (592 examples) yield higher scores than human‑detected ones (1,188 examples) because the former follow more regular patterns that models can capture.
7) How do incident owners evaluate the generated outputs? In interviews covering 50 incidents, engineers rated GPT‑3.x outputs as useful for both root causes and solutions, while RoBERTa and CodeBERT received lower scores.
The study demonstrates that fine‑tuned LLMs can serve as practical assistants for cloud incident management, especially for routine, pattern‑driven failures, while highlighting the limited benefit of multi‑task learning when root causes and solutions are loosely coupled.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Network Intelligence Research Center (NIRC)
NIRC is based on the National Key Laboratory of Network and Switching Technology at Beijing University of Posts and Telecommunications. It has built a technology matrix across four AI domains—intelligent cloud networking, natural language processing, computer vision, and machine learning systems—dedicated to solving real‑world problems, creating top‑tier systems, publishing high‑impact papers, and contributing significantly to the rapid advancement of China's network technology.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
