How Large Language Models Can Transform Ops Fault Handling: A Practical Guide
This article outlines a typical operations incident workflow, identifies four key stages where large language models can assist, discusses implementation challenges, introduces the Ops framework and Copilot design, and shares practical examples and a real‑world case to help engineers adopt AI‑driven fault management.
This content originates from an internal sharing session and recent work summary.
1. Common Fault Handling Process
The diagram above shows a typical ops incident handling flow.
Key timestamps along the timeline are:
Fault occurrence
Fault detection
Fault response
Fault localization
Fault recovery
From occurrence to detection depends on metric collection and alert intervals (e.g., 15 s collection, 1 min detection). Detection to response varies by time of day; during off‑hours response may take hours, while in working hours it can be minutes.
Response to localization requires identifying the root cause; this depends heavily on the engineer’s experience. Newcomers may need hours, while seasoned ops can pinpoint issues in minutes.
Localization to recovery involves fixing the issue and restoring the service level objective (SLO). Some problems (e.g., application bugs) require developer involvement after ops have identified the cause.
These five stages form the core incident workflow; subsequent steps such as SLO observation, post‑mortem, optimization, and chaos testing are beyond the basic handling process.
2. Stages Where Large Models Can Contribute
According to the analysis, large models can intervene at four points: discovery, response, localization, and handling.
2.1 Discovery
When a fault is discovered, humans have not yet responded. An AI agent that automatically reacts to alerts could achieve the fastest response and significantly reduce mean time to failure (MTTF).
However, early intervention is difficult because it requires an AI agent that can automatically ingest alerts, collect metrics, call platform APIs, and even log into machines to attempt remediation.
Implementing such an agent is non‑trivial; underestimating the complexity of real‑world operations is a common mistake.
2.2 Response
During response, a large model that performs preliminary analysis can narrow the fault scope, accelerating subsequent localization.
Pre‑analysis relies on a well‑maintained knowledge base of past incidents; sufficient root‑cause data is essential for the model to be effective.
2.3 Localization
Observability data now includes events, metrics, logs, and traces, increasing the number of data sources to query.
A large model can shorten the time needed to query these sources and, based on keywords, retrieve relevant documentation and suggest remediation steps.
2.4 Handling
The model can also execute remediation actions such as restarting a Deployment, restarting Kubelet, adjusting routing, or moving a node—typically a single command or API call.
Automating these actions through the model saves considerable time.
2.5 Summary
While large models can participate in many stages, earlier intervention is harder; later stages are easier to implement. A practical approach is to start with later stages, accumulate documentation and cases, and gradually move the AI agent’s involvement forward.
Early‑stage faults involve a broad scope that large models struggle to capture; human expertise still excels in flexibility and learning.
The eventual strategy is to let the model handle later stages first, then iteratively shift its involvement earlier as the knowledge base grows.
3. Challenges When Using Large Models for Fault Handling
3.1 Converting Text to Ops Actions
Large models typically output text, images, or video. Translating this output into concrete commands or operational actions is the first hurdle.
3.2 Unstable Information Extraction
Determinism is crucial for automation, yet large models are inherently nondeterministic. Common issues include misunderstood intent, incorrect output format, missing parameters, etc.
Prompt engineering
Retry mechanisms
Model fine‑tuning
Beyond these, application‑level design can also mitigate instability.
3.3 Rapid Scenario Integration
Fast validation and iteration should be ingrained in engineers. By abstracting atomic operations and composing them into pipelines, countless scenarios can be covered.
Alert handling
Daily ops assistance
4. Key Technology – Ops Overview
Each domain may need an "Ops" project that provides large‑model‑driven capabilities.
OpsObject – stores operation objects via CRD, manages clusters and hosts.
Core – implements file distribution and script execution.
Task – packages and composes operations, providing lightweight orchestration.
Tools – offers three external entry points.
4.1 Example – Viewing Objects
The UI shows cluster node count, certificate expiry, node configuration, and GPU status.
4.2 Example – Opscli
shell – execute scripts on hosts.
file – transfer files between hosts, S3, or image registries.
task – orchestrate multiple shell/file operations.
Supports kubeconfig‑only credentials for node‑level execution.
Supports SA authentication in Kubectl.
4.3 Example – Web UI
Server – provides API endpoints.
Web – simple management UI.
4.4 Example – Task
Task defines a reusable template.
<code>apiVersion: crd.chenshaowen.com/v1
kind: Task
metadata:
name: cron-clear-disk
namespace: ops-system
spec:
desc: cron to create clear disk
selector:
managed-by: ops
typeRef: host
steps:
- name: clear > 100M log
content: find /var/log -type f -name "*.log" -size +100M -exec rm -f {} \; 2>/dev/null || true
- name: clear jfs cache
content: |
find /data/jfs/cache2/mem -maxdepth 1 -type d -atime +15 -exec rm -rf {} + 2>/dev/null || true
find /var/lib/jfs/cache -maxdepth 1 -type d -atime +15 -exec rm -rf {} + 2>/dev/null || true
find /var/lib/jfs/cache2 -maxdepth 1 -type d -atime +15 -exec rm -rf {} + 2>/dev/null || true
</code>Running the task via TaskRun:
<code>apiVersion: crd.chenshaowen.com/v1
kind: TaskRun
metadata:
name: cron-clear-disk
namespace: ops-system
spec:
ref: cron-clear-disk
</code>5. Copilot Design
Copilot is the current production form for using large models to handle ops incidents. It interacts via dialogue, first tackling later stages of incident handling and gradually moving earlier.
5.1 Key Steps
Ops project provides operational capabilities to Copilot.
Pipeline system offers scenario integration.
Step 1: the model selects an appropriate pipeline. Step 2: the model extracts parameters from the incident context. This resembles a
function_callbut invokes a
pipelineinstead of a function.
5.2 Pipeline Design
The pipeline aims to be easy for the model to recognize, extensible to cover more scenarios, and composable so the model can assemble new pipelines.
We have defined 95 tasks and 20 pipelines, all as CR objects describable in YAML.
Example input to the model:
<code>Please select the most appropriate option to classify the intention of the user.
Don't ask any more questions, just select the option.
Must be one of the following options:
- xxx-es-log-analysis(...)
- xxx-grafana-alert-node-disk-pressure(...)
- cluster-clear-disk(...)
- ...
</code>5.3 Variable Design
Variables include default values, descriptions, regex, required flag, enums, examples, and fixed values. Priority order: task fixed > pipeline fixed > runtime extracted.
Well‑designed variables improve parameter accuracy, increase task success rate, and protect sensitive information.
Sample variable definition sent to the model:
<code>apiVersion: crd.chenshaowen.com/v1
kind: TaskRun
metadata:
name: cron-clear-disk
namespace: ops-system
spec:
crontab: 0 0 * * *
ref: cron-clear-disk
</code>6. Proactive Fault Discovery – Turning the Flywheel
Passive waiting for incidents leads to slow data accumulation; proactive inspection can surface potential issues before they cause outages.
Inspection covers device, driver, and system layers, and newly added nodes automatically join the inspection set.
7. Typical Case
AI accelerator cards often overheat, causing frequent failures that traditionally require on‑site repair.
Now, simply @‑mention Copilot in IM to trigger remediation.
Resolution time dropped from tens of minutes to a few minutes, and security improved by avoiding manual AK/SK exposure.
8. Summary
Incident handling timeline and the stages where large models can participate.
Large models can engage in discovery, response, localization, and remediation.
Start with the stage closest to resolution and gradually move the AI agent earlier.
Explore the Ops project for practical implementation.
function_callis one approach; pipelines or workflows are equally viable.
Precise variable definitions are critical when building LLM‑driven applications.
Ops Development Stories
Maintained by a like‑minded team, covering both operations and development. Topics span Linux ops, DevOps toolchain, Kubernetes containerization, monitoring, log collection, network security, and Python or Go development. Team members: Qiao Ke, wanger, Dong Ge, Su Xin, Hua Zai, Zheng Ge, Teacher Xia.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.