Operations 18 min read

Mastering Online Incident Management: From Detection to Prevention

This article outlines a comprehensive methodology for handling large‑scale online service incidents, covering goals, the "jump‑fill‑avoid" framework, step‑by‑step processes for detection, diagnosis, remediation, and post‑mortem analysis, as well as essential monitoring, logging, and escalation infrastructure.

ITPUB
ITPUB
ITPUB
Mastering Online Incident Management: From Detection to Prevention

Goal of Online Incident Handling

The primary objective is to restore service quickly (Jump), then identify and fix the root cause (Fill), and finally prevent recurrence (Avoid). The three phases are prioritized in that order.

Jump (Recover)

Rapidly bring the service back online or minimise impact. Service availability directly affects user experience and business revenue, so the first response must focus on restoration, even if only a partial mitigation is possible.

Fill (Root‑Cause Fix)

Conduct a thorough investigation to locate the underlying problem and eliminate it. Recovery and root‑cause analysis often run in parallel; temporary measures such as service restart, degradation, or circuit‑breaker may be applied while the deeper cause is being pursued.

Avoid (Prevent)

After the incident is resolved, analyse the entire process, identify weak points in architecture, processes or policies, and implement corrective actions to avoid similar failures.

Incident Handling Process

Incident Detection

Incident Diagnosis

Incident Remediation

Incident Retrospective

The first three steps correspond to the "Jump" phase; the retrospective combines "Fill" and "Avoid". Teams from development, operations, testing and product should work concurrently.

Incident Detection

Signals can arrive through several channels, ordered by increasing severity:

Proactive discovery (log inspection, routine health checks)

System‑monitoring alerts (CPU, memory, I/O, TCP connections, disk, threads, GC, connection‑pool, etc.)

Business‑monitoring alerts (login‑failure rate, order backlog, etc.)

Upstream/downstream fault tracing

Production event reports from customers or support staff

After a signal is received, verify its authenticity by cross‑checking business metrics, event counts, reproducibility and server statistics.

Incident Diagnosis

Follow a loop of collect information → hypothesise → verify → test . Typical suspicion points include:

New release bugs

Latent bugs triggered by traffic spikes

Attack traffic (e.g., credential stuffing)

Upstream service changes

Downstream service failures

Network issues

Server resource exhaustion (CPU, memory, disk)

Application exceptions

Database outages

Key data to gather:

Recent release history

Error and stack‑trace logs

Request volume and throughput trends

Latency changes

TCP connection states (e.g., many CLOSE_WAIT sockets)

Server resource utilisation (CPU, memory, disk I/O)

Database or storage health metrics

Multiple factors often combine; therefore collect as much evidence as possible and test hypotheses in parallel.

Incident Remediation

When the cause is identified, apply one or more of the following actions:

Service degradation or isolation of the faulty component

Emergency scaling of resources (horizontal or vertical)

Rollback to a previously stable version

If the root cause is still unknown, use the same measures to reduce impact while investigation continues.

Incident Retrospective (Post‑mortem)

Review the entire detection‑diagnosis‑remediation flow, document process, architectural or policy gaps, and define concrete corrective actions. The output should be a formal incident report that serves knowledge sharing and continuous improvement.

Supporting Infrastructure

Monitoring & Alerting

A robust monitoring system should provide real‑time alerts and historical trends for:

Server metrics (CPU, memory, disk, I/O, network)

Service health (availability, request latency, throughput)

Business metrics (traffic volume, error rates)

Alert thresholds must be dynamic and tuned to business context.

Log Trace System

Effective log tracing requires:

Unique request identifiers that propagate across services

Automated collection of distributed logs

Ingestion latency preferably under 10 minutes

Rich query, filtering and aggregation capabilities

Time‑series visualisation

Open‑source stacks such as logstash + elasticsearch + kibana are commonly used.

Incident Handling Mechanism

Fast resolution depends on streamlined event routing and collaboration:

A dedicated “second‑line” team receives production tickets, routes them to the responsible owners, and convenes a joint troubleshooting session.

A technical lead (the “master”) coordinates the effort, makes decisions, and allocates resources.

Summary

The core of online incident handling is rapid service restoration, followed by systematic root‑cause analysis and preventive measures. Achieving this requires parallel information gathering, decisive leadership, and supporting infrastructure such as monitoring, logging, and efficient escalation processes.

Case study (large stack‑trace logs degrading service availability): http://www.cnblogs.com/daoqidelv/p/6786649.html

Reference (Google SRE concepts): http://www.jianshu.com/p/60cb877d9409

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

monitoringOperationsSREincident managementpostmortemlog tracing
ITPUB
Written by

ITPUB

Official ITPUB account sharing technical insights, community news, and exciting events.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.