Mastering Incident Management: Core Principles and Practical Methods
This guide outlines essential incident management principles—prioritizing business restoration and timely escalation—followed by detailed methodologies such as restart, isolation, and degradation, and explains role responsibilities, user impact handling, and post‑incident summarization for continuous improvement.
1. Fault Handling Principles
The principles of fault handling are twofold: prioritize business recovery and timely escalation.
Business recovery first
Timely escalation
1.1 Business Recovery First
Regardless of the situation or fault level, the immediate goal is to restore service, not to locate the root cause. For example, when Application A fails to call Application B, two approaches are possible:
Method 1: Diagnose the issue, identify the problematic link (e.g., HA connection failure), and restart or scale the component.
Method 2: Ping B from A's server; if the network and port are reachable, bind B's host directly. The second method is usually faster, especially across data centers, and can immediately restore service.
1.2 Timely Escalation
When a fault occurs, its impact must be quickly reported to leadership to coordinate resources. Escalation is required if any of the following conditions are met:
Clear business impact (e.g., PV, UV, cart, order, payment metrics fluctuate).
Critical business alerts (e.g., core services, essential components).
Processing time exceeds defined thresholds.
Senior leaders, monitoring centers, or customer service have noticed the fault.
The issue is beyond the handler's capability.
Note: Operations leaders must be the first to know about any incident; otherwise the handler is considered negligent.
2. Fault Handling Methodology
Incident handling is divided into three stages: before, during, and after the fault. This article focuses on methods used during the incident.
2.1 Operational Methods Based on Fault Service
The three most important actions for restoring service are restart, isolation, and degradation .
Restart: Includes service restart and OS restart. The typical order is the faulty object, then its upstream, then downstream components.
Example: If RabbitMQ fails to send messages, restart RabbitMQ first; if ineffective, restart the upstream producer; if still ineffective, restart the downstream consumer.
Do not delay restart while searching for root cause; the priority is service restoration.
Isolation: Remove the faulty component from the cluster to stop it from providing service. Common methods are:
Set upstream weight to zero or stop the service if health checks exist.
Bypass the faulty component via hosts binding or routing changes.
Be careful to avoid avalanche effects.
Degradation: Implement a fallback plan to prevent larger failures. Degradation is not the optimal user experience but ensures continuity (e.g., alternative payment channels).
Degradation requires coordination with product development; consider failure scenarios early in project planning.
2.2 Handling Based on Impact Scope
Impacted parties are divided into external and internal users.
2.2.1 External Users
Try to reproduce the issue locally. If reproducible, treat it as an internal problem. If not, involve other internal users to rule out environment issues, and have the external user bind hosts to bypass DNS problems. If these steps fail, collect necessary external information (e.g., exit IP, client version) for further analysis.
2.2.2 Internal Users
Issues raised by internal applications or staff follow the same procedures described in 2.1.
2.3 Organizational Structure During Incident Handling
Three roles typically act simultaneously:
Incident Responder – focuses on rapid business restoration.
Incident Investigator – steps in when responder methods fail or root‑cause analysis is needed.
Communicator – transmits effective information within the team and updates external stakeholders.
In practice, especially during off‑hours, only the responder may be active; investigators handle post‑incident analysis the next day.
3. Incident Summary
Each incident requires a thorough summary to address root causes, prevent recurrence, and apply PDCA for continuous improvement.
Efficient Ops
This public account is maintained by Xiaotianguo and friends, regularly publishing widely-read original technical articles. We focus on operations transformation and accompany you throughout your operations career, growing together happily.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.