Operations 26 min read

Standardized SRE On‑Call Handbook: Alert Grading, Response Flow, and Handoff Templates

This handbook presents a complete, two‑year‑tested SRE on‑call process that defines alert severity tiers, response requirements, escalation paths, War‑Room roles, handoff schedules, post‑mortem procedures, and provides ready‑to‑use configuration snippets, checklists and templates to reduce MTTR and repeat incidents.

MaGe Linux Operations
MaGe Linux Operations
MaGe Linux Operations
Standardized SRE On‑Call Handbook: Alert Grading, Response Flow, and Handoff Templates

Overview

The on‑call duty of an SRE team is a core responsibility. Many organizations suffer from chaotic alert handling, missing records, oral handoffs and repeated incidents. This handbook provides a complete, end‑to‑end on‑call template that standardises alert grading, response, escalation, shift handoff and post‑mortem activities.

Alert grading standard

The alert pyramid defines four severity levels (P0–P3) with specific response windows and escalation triggers.

┌─────────────────────────────────────────────────┐
│            Alert Grading Pyramid                │
├─────────────────────────────────────────────────┤
│                     ▲  P0                      │
│          Major outage – full site down – 5 min   │
│          ──────┼────── P1 ──────
│          Severe fault – core function impaired – 15 min │
│          ──────┼────── P2 ──────
│          General fault – non‑core impact – 30 min │
│          ──────┼────── P3 ──────
│          Minor issue – performance drop – 4 h   │
└─────────────────────────────────────────────────┘

P0 – Critical failure

Definition: Entire site or core business unavailable, directly affecting revenue or reputation.

Full site outage

Payment system completely down

Core database crash

Major security incident

Response requirements:

Respond within 5 minutes

Escalate immediately to technical director and business owner

Form a War‑Room with all relevant personnel

P1 – Severe fault

Definition: Core functionality severely degraded, affecting many users.

Login service failure

Core API timeout

Data sync severe delay

Partial region outage

Response requirements:

Respond within 15 minutes

Escalate to technical manager after 30 minutes without progress

On‑call engineer leads the fix; support may be added

P2 – General fault

Definition: Non‑core feature abnormal, affecting a subset of users.

Backend admin panel error

Non‑critical service degradation

Monitoring system failure

Log collection interruption

Response requirements:

Respond within 30 minutes

Escalate to technical manager after 2 hours without resolution

On‑call engineer handles during work hours

P3 – Minor issue

Definition: Performance dip or potential risk, no impact on normal usage.

High CPU/memory usage

Disk space warning

SSL certificate nearing expiry

Acceptable performance degradation

Response requirements:

Respond within 4 hours (working hours)

No escalation needed; routine handling

Record ticket for later resolution

Alertmanager configuration example (Prometheus)

# Prometheus Alertmanager configuration example
route:
  receiver: 'default'
receivers:
  - name: 'p0-critical'
    webhook_configs:
      - url: 'http://alert-gateway/webhook/phone'   # phone
      - url: 'http://alert-gateway/webhook/sms'    # SMS
      - url: 'http://alert-gateway/webhook/wechat' # WeChat
  - name: 'p1-high'
    webhook_configs:
      - url: 'http://alert-gateway/webhook/sms'
      - url: 'http://alert-gateway/webhook/wechat'
  - name: 'p2-medium'
    webhook_configs:
      - url: 'http://alert-gateway/webhook/wechat'
  - name: 'p3-low'
    webhook_configs:
      - url: 'http://alert-gateway/webhook/wechat-silent'
routes:
  - match:
      severity: P0
    receiver: 'p0-critical'
    continue: false
    group_wait: 0s
    repeat_interval: 5m
  - match:
      severity: P1
    receiver: 'p1-high'
    continue: false
    group_wait: 30s
    repeat_interval: 15m
  - match:
      severity: P2
    receiver: 'p2-medium'
    continue: false
    group_wait: 1m
    repeat_interval: 30m
  - match:
      severity: P3
    receiver: 'p3-low'
    continue: false
    group_wait: 5m
    repeat_interval: 4h

On‑call response process

Response flow diagram

┌─────────────────────────────────────┐
│          Alert Response Flow          │
├─────────────────────────────────────┤
│   Alert Trigger → On‑Call Engineer?   │
│   ──No──► Escalate to backup          │
│   │Yes                               │
│   ▼                                   │
│   Confirm Alert (ACK)                │
│   ──Is false alarm?──► Optimize/Close│
│   │No                                 │
│   ▼                                   │
│   Assess impact → Determine level     │
│   ──P0/P1──► Build War‑Room           │
│   ──P2/P3──► Parallel handling       │
│   │                                   │
│   │   Fix → Verify → Close incident   │
└─────────────────────────────────────┘

Alert response checklist

## Alert Response Checklist
### Step 1: Confirm alert (within 2 min)
- [ ] Receive notification
- [ ] ACK in alert system
- [ ] Notify team channel
- [ ] Record start time

### Step 2: Initial assessment (within 5 min)
- [ ] Check for false alarm
- [ ] Evaluate impact (users, business)
- [ ] Determine severity (P0‑P3)
- [ ] Decide on escalation or War‑Room

### Step 3: Fault location
- [ ] View monitoring dashboards
- [ ] Review recent changes
- [ ] Check service logs
- [ ] Consult relevant Runbook

### Step 4: Fault handling
- [ ] Perform mitigation (restore service)
- [ ] Identify root cause
- [ ] Apply fix
- [ ] Verify resolution

### Step 5: Incident closure
- [ ] Confirm metrics back to normal
- [ ] Business confirms functionality
- [ ] Fill incident report
- [ ] Close alert and ticket
- [ ] Notify team of recovery

War‑Room setup and operation

Conditions for War‑Room:

P0‑level incident

P1 incident unresolved for >30 min

Cross‑team fault requiring multi‑party collaboration

Technical leader deems it necessary

Roles:

┌───────────────────────────────┐
│          War‑Room Roles         │
├───────────────────────────────┤
│ Incident Commander (IC) – coordinates, final decisions │
│ Technical Team – investigates, fixes               │
│ Communication Coordinator – external updates      │
│ Recorder – logs timeline, actions                  │
└───────────────────────────────────────┘

Operational guidelines:

## War‑Room Operation
### Start
1. IC announces War‑Room start.
2. Summarise current status and known info.
3. Confirm participants.
4. Set communication channel (main chat, voice).

### During
1. IC gives status update every 15 min.
2. Technical discussion in sub‑channel.
3. Important decisions announced in main channel.
4. Recorder logs timeline in real time.

### End
1. Confirm full recovery.
2. Recorder compiles complete timeline.
3. IC declares War‑Room closed.
4. Schedule post‑mortem.

### Communication templates
- Status update: "[time] [status] Current: xxx, Doing: xxx, ETA: xxx"
- Request support: "Need xxx team member to assist with xxx"
- Decision announcement: "Decided to execute xxx, expected effect xxx, owner xxx"

Escalation mechanism

Escalation paths

┌─────────────────────┐
│      Escalation      │
├─────────────────────┤
│ Level 1: On‑call engineer (15 min unresolved or P0) │
│   → Level 2: Senior engineer (30 min unresolved) │
│   → Level 3: Technical manager (1 h unresolved) │
│   → Level 4: CTO / Director (P0 or customer/media impact) │
└─────────────────────┘

Trigger conditions

L1→L2: P0/P1 not resolved within 15 min – notify second‑line engineer via phone/IM.

L2→L3: No progress after 30 min or cross‑team coordination needed – notify technical manager via phone/meeting.

L3→L4: P0 persists >1 h with media attention – notify CTO via phone/report.

Shift handoff process

Handoff schedule

┌───────────────────────────────┐
│      Shift Handoff Schedule    │
├───────────────────────────────┤
│ Weekday:
│ 09:00‑09:30 – Night → Morning handoff (sync alerts, status)
│ 09:30‑18:00 – Morning shift (handle alerts, stand‑up)
│ 18:00‑18:30 – Evening handoff (sync day status, pending items)
│ 18:30‑09:00 – Night shift (handle overnight alerts, log non‑urgent issues)
│ Weekend:
│ 09:00‑21:00 – Day shift
│ 21:00‑09:00 – Night shift
└───────────────────────────────┘

Handoff checklist

## Handoff Checklist
### 1. Alert status
- [ ] Number of alerts generated during shift
- [ ] Open alerts and their progress
- [ ] Issues requiring next shift attention

### 2. Incident summary
- [ ] Incidents resolved
- [ ] Ongoing incidents
- [ ] Potential risks

### 3. Change records
- [ ] Production changes made
- [ ] Planned upcoming changes
- [ ] Issues caused by changes

### 4. To‑do items
- [ ] Tasks to continue
- [ ] Open tickets
- [ ] Other pending work

### 5. Special attention
- [ ] Services needing extra monitoring
- [ ] Upcoming major events (promotions, releases)
- [ ] Other notices

Post‑mortem process

Post‑mortem meeting flow

┌─────────────────────────────────────┐
│          Post‑mortem Flow           │
├─────────────────────────────────────┤
│ Pre‑mortem (within 24 h):
│   • Collect timeline, logs, metrics
│   • Draft initial report
│   • Invite participants
│
│ Meeting (within 48 h):
│   1. Incident recap (10 min)
│   2. Timeline walk‑through (15 min)
│   3. Root‑cause analysis (20 min, 5‑Why)
│   4. Improvement actions (15 min)
│   5. Summary & action items (10 min)
│
│ Follow‑up:
│   • Publish report
│   • Create improvement tickets
│   • Track progress regularly
└─────────────────────────────────────┘

Post‑mortem report template

# Incident Post‑mortem Report
## Basic Info
| Item | Details |
|------|---------|
| Title | [Brief description]
| Severity | P0‑P3
| Time | 2024‑01‑15 14:30 – 15:45 (1 h 15 min)
| Impact | Site‑wide order failure, 80 % order drop, ~XX k loss
| Owner(s) | Zhang San (lead), Li Si (support)
| Review date | 2024‑01‑16
| Author | Wang Wu |
---
## Summary
[2‑3 sentences summarising cause, impact and resolution]
---
## Timeline
| Time | Event | Owner |
|------|-------|-------|
|14:30|Alert fired: order service 5xx >10 %|System|
|14:32|ACK by on‑call engineer|Zhang San|
|14:35|Initial check: DB connection timeout|Zhang San|
|14:40|Escalated to DBA|Zhang San|
|14:50|Root cause: connection‑pool max = 5 (should be 100)|Li Si|
|15:00|Decision to roll back config|Zhang & Li|
|15:10|Executed rollback|Li Si|
|15:20|Service recovery verification|Zhang San|
|15:45|Incident closed|Zhang San|
---
## Root‑Cause Analysis (5‑Why)
1. Why orders failed? → DB connections unavailable.
2. Why no connections? → Pool exhausted.
3. Why pool exhausted? → Max connections set to 5.
4. Why mis‑set? → Ops edited production config while testing.
5. Why no guard? → No environment isolation or change‑review process.
---
## Impact assessment
- Users affected: ~100 k
- Business impact: Order loss ~XX k, 23 complaints
- SLA impact: Availability dropped from 99.95 % to 99.87 % (75 min)
---
## Improvement actions
### Short‑term (≤1 week)
| Action | Owner | Due | Status |
|--------|-------|-----|--------|
| Verify config rollback safety | Li Si | 2024‑01‑16 | Done |
| Add DB connection pool alert | Zhang San | 2024‑01‑17 | In‑progress |
| List high‑risk config items | Ops team | 2024‑01‑18 | Planned |
### Long‑term (≤1 month)
| Action | Owner | Due | Status |
|--------|-------|-----|--------|
| Implement production/test isolation | Architecture | 2024‑02‑01 | Planned |
| Establish config change approval | Ops | 2024‑02‑15 | Planned |
| Enable staged rollout for config changes | Development | 2024‑02‑28 | Planned |
---
## Lessons learned
### What went well
- Alert acknowledged within 2 min.
- Quick escalation to DBA.
- Rollback executed cleanly.
### Areas to improve
- Strengthen config‑management process.
- Add dual‑approval for critical configs.
- Improve DB connection pool monitoring.
---
## Follow‑up
- 2024‑01‑20: Verify short‑term actions.
- 2024‑02‑01: Review environment isolation.
- 2024‑03‑01: Confirm all improvements deployed.
---

Runbook index (selected)

Server SSH connectivity troubleshooting

Disk space cleanup procedure

Server reboot workflow

Network connectivity troubleshooting

DNS resolution failure handling

Load‑balancer health‑check failure

MySQL high‑connection count handling

MySQL slow‑query investigation

MySQL replication lag resolution

Redis memory alert handling

Redis connection‑count high

Redis cluster node failure

Kafka consumer lag mitigation

Kafka disk alert

Kafka broker down handling

Order service 5xx error handling

Order service timeout troubleshooting

Payment callback failure handling

Payment channel exception handling

Reconciliation discrepancy handling

Common command shortcuts

# Service status shortcuts
alias k='kubectl'
alias kp='kubectl get pods'
alias kl='kubectl logs -f'
alias kd='kubectl describe'
# Quick SSH
alias prod01='ssh prod-app-01'
alias prod02='ssh prod-app-02'
# Log tailing
alias tailapp='tail -f /var/log/app/app.log'
alias tailnginx='tail -f /var/log/nginx/access.log'
# System overview
alias sysinfo='echo "=== CPU ===" && top -bn1 | head -5 && echo "
=== Memory ===" && free -h && echo "
=== Disk ===" && df -h'
# DB shortcuts
alias mysql-prod='mysql -h prod-db.example.com -u readonly -p'
alias redis-prod='redis-cli -h prod-redis.example.com -p 6379'
# Troubleshooting tools
alias httpstat='curl -w "@curl-format.txt" -o /dev/null -s'
alias tcpstat='ss -s'
alias netstat='netstat -tlnp'

Glossary

On‑Call : Rotating responsibility for handling alerts and incidents.

Alert : Notification generated by monitoring when an abnormal condition is detected.

ACK : Acknowledge – confirming receipt of an alert and beginning investigation.

War Room : Dedicated coordination space for major incidents.

Incident Commander (IC) : Person who leads the response and makes final decisions.

Mitigation : Immediate actions to restore service, not necessarily fixing the root cause.

Root Cause : Underlying reason that caused the incident.

Postmortem : Review meeting after incident resolution.

Runbook : Standardised operational playbook.

SLA : Service Level Agreement.

MTTR : Mean Time To Recovery.

MTTD : Mean Time To Detect.

operationsSREAlert Managementincident responsepostmortemOn-CallWar RoomRunbook
MaGe Linux Operations
Written by

MaGe Linux Operations

Founded in 2009, MaGe Education is a top Chinese high‑end IT training brand. Its graduates earn 12K+ RMB salaries, and the school has trained tens of thousands of students. It offers high‑pay courses in Linux cloud operations, Python full‑stack, automation, data analysis, AI, and Go high‑concurrency architecture. Thanks to quality courses and a solid reputation, it has talent partnerships with numerous internet firms.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.