Mastering Online Incident Management: From Detection to Prevention
This article outlines a comprehensive methodology for handling large‑scale online service incidents, covering goals, the "jump‑fill‑avoid" framework, step‑by‑step processes for detection, diagnosis, remediation, and post‑mortem analysis, as well as essential monitoring, logging, and escalation infrastructure.
Goal of Online Incident Handling
The primary objective is to restore service quickly (Jump), then identify and fix the root cause (Fill), and finally prevent recurrence (Avoid). The three phases are prioritized in that order.
Jump (Recover)
Rapidly bring the service back online or minimise impact. Service availability directly affects user experience and business revenue, so the first response must focus on restoration, even if only a partial mitigation is possible.
Fill (Root‑Cause Fix)
Conduct a thorough investigation to locate the underlying problem and eliminate it. Recovery and root‑cause analysis often run in parallel; temporary measures such as service restart, degradation, or circuit‑breaker may be applied while the deeper cause is being pursued.
Avoid (Prevent)
After the incident is resolved, analyse the entire process, identify weak points in architecture, processes or policies, and implement corrective actions to avoid similar failures.
Incident Handling Process
Incident Detection
Incident Diagnosis
Incident Remediation
Incident Retrospective
The first three steps correspond to the "Jump" phase; the retrospective combines "Fill" and "Avoid". Teams from development, operations, testing and product should work concurrently.
Incident Detection
Signals can arrive through several channels, ordered by increasing severity:
Proactive discovery (log inspection, routine health checks)
System‑monitoring alerts (CPU, memory, I/O, TCP connections, disk, threads, GC, connection‑pool, etc.)
Business‑monitoring alerts (login‑failure rate, order backlog, etc.)
Upstream/downstream fault tracing
Production event reports from customers or support staff
After a signal is received, verify its authenticity by cross‑checking business metrics, event counts, reproducibility and server statistics.
Incident Diagnosis
Follow a loop of collect information → hypothesise → verify → test . Typical suspicion points include:
New release bugs
Latent bugs triggered by traffic spikes
Attack traffic (e.g., credential stuffing)
Upstream service changes
Downstream service failures
Network issues
Server resource exhaustion (CPU, memory, disk)
Application exceptions
Database outages
Key data to gather:
Recent release history
Error and stack‑trace logs
Request volume and throughput trends
Latency changes
TCP connection states (e.g., many CLOSE_WAIT sockets)
Server resource utilisation (CPU, memory, disk I/O)
Database or storage health metrics
Multiple factors often combine; therefore collect as much evidence as possible and test hypotheses in parallel.
Incident Remediation
When the cause is identified, apply one or more of the following actions:
Service degradation or isolation of the faulty component
Emergency scaling of resources (horizontal or vertical)
Rollback to a previously stable version
If the root cause is still unknown, use the same measures to reduce impact while investigation continues.
Incident Retrospective (Post‑mortem)
Review the entire detection‑diagnosis‑remediation flow, document process, architectural or policy gaps, and define concrete corrective actions. The output should be a formal incident report that serves knowledge sharing and continuous improvement.
Supporting Infrastructure
Monitoring & Alerting
A robust monitoring system should provide real‑time alerts and historical trends for:
Server metrics (CPU, memory, disk, I/O, network)
Service health (availability, request latency, throughput)
Business metrics (traffic volume, error rates)
Alert thresholds must be dynamic and tuned to business context.
Log Trace System
Effective log tracing requires:
Unique request identifiers that propagate across services
Automated collection of distributed logs
Ingestion latency preferably under 10 minutes
Rich query, filtering and aggregation capabilities
Time‑series visualisation
Open‑source stacks such as logstash + elasticsearch + kibana are commonly used.
Incident Handling Mechanism
Fast resolution depends on streamlined event routing and collaboration:
A dedicated “second‑line” team receives production tickets, routes them to the responsible owners, and convenes a joint troubleshooting session.
A technical lead (the “master”) coordinates the effort, makes decisions, and allocates resources.
Summary
The core of online incident handling is rapid service restoration, followed by systematic root‑cause analysis and preventive measures. Achieving this requires parallel information gathering, decisive leadership, and supporting infrastructure such as monitoring, logging, and efficient escalation processes.
Case study (large stack‑trace logs degrading service availability): http://www.cnblogs.com/daoqidelv/p/6786649.html
Reference (Google SRE concepts): http://www.jianshu.com/p/60cb877d9409
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
ITPUB
Official ITPUB account sharing technical insights, community news, and exciting events.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
