Standardized SRE On‑Call Handbook: Alert Grading, Response Flow, and Handoff Templates
This handbook presents a complete, two‑year‑tested SRE on‑call process that defines alert severity tiers, response requirements, escalation paths, War‑Room roles, handoff schedules, post‑mortem procedures, and provides ready‑to‑use configuration snippets, checklists and templates to reduce MTTR and repeat incidents.
Overview
The on‑call duty of an SRE team is a core responsibility. Many organizations suffer from chaotic alert handling, missing records, oral handoffs and repeated incidents. This handbook provides a complete, end‑to‑end on‑call template that standardises alert grading, response, escalation, shift handoff and post‑mortem activities.
Alert grading standard
The alert pyramid defines four severity levels (P0–P3) with specific response windows and escalation triggers.
┌─────────────────────────────────────────────────┐
│ Alert Grading Pyramid │
├─────────────────────────────────────────────────┤
│ ▲ P0 │
│ Major outage – full site down – 5 min │
│ ──────┼────── P1 ──────
│ Severe fault – core function impaired – 15 min │
│ ──────┼────── P2 ──────
│ General fault – non‑core impact – 30 min │
│ ──────┼────── P3 ──────
│ Minor issue – performance drop – 4 h │
└─────────────────────────────────────────────────┘P0 – Critical failure
Definition: Entire site or core business unavailable, directly affecting revenue or reputation.
Full site outage
Payment system completely down
Core database crash
Major security incident
Response requirements:
Respond within 5 minutes
Escalate immediately to technical director and business owner
Form a War‑Room with all relevant personnel
P1 – Severe fault
Definition: Core functionality severely degraded, affecting many users.
Login service failure
Core API timeout
Data sync severe delay
Partial region outage
Response requirements:
Respond within 15 minutes
Escalate to technical manager after 30 minutes without progress
On‑call engineer leads the fix; support may be added
P2 – General fault
Definition: Non‑core feature abnormal, affecting a subset of users.
Backend admin panel error
Non‑critical service degradation
Monitoring system failure
Log collection interruption
Response requirements:
Respond within 30 minutes
Escalate to technical manager after 2 hours without resolution
On‑call engineer handles during work hours
P3 – Minor issue
Definition: Performance dip or potential risk, no impact on normal usage.
High CPU/memory usage
Disk space warning
SSL certificate nearing expiry
Acceptable performance degradation
Response requirements:
Respond within 4 hours (working hours)
No escalation needed; routine handling
Record ticket for later resolution
Alertmanager configuration example (Prometheus)
# Prometheus Alertmanager configuration example
route:
receiver: 'default'
receivers:
- name: 'p0-critical'
webhook_configs:
- url: 'http://alert-gateway/webhook/phone' # phone
- url: 'http://alert-gateway/webhook/sms' # SMS
- url: 'http://alert-gateway/webhook/wechat' # WeChat
- name: 'p1-high'
webhook_configs:
- url: 'http://alert-gateway/webhook/sms'
- url: 'http://alert-gateway/webhook/wechat'
- name: 'p2-medium'
webhook_configs:
- url: 'http://alert-gateway/webhook/wechat'
- name: 'p3-low'
webhook_configs:
- url: 'http://alert-gateway/webhook/wechat-silent'
routes:
- match:
severity: P0
receiver: 'p0-critical'
continue: false
group_wait: 0s
repeat_interval: 5m
- match:
severity: P1
receiver: 'p1-high'
continue: false
group_wait: 30s
repeat_interval: 15m
- match:
severity: P2
receiver: 'p2-medium'
continue: false
group_wait: 1m
repeat_interval: 30m
- match:
severity: P3
receiver: 'p3-low'
continue: false
group_wait: 5m
repeat_interval: 4hOn‑call response process
Response flow diagram
┌─────────────────────────────────────┐
│ Alert Response Flow │
├─────────────────────────────────────┤
│ Alert Trigger → On‑Call Engineer? │
│ ──No──► Escalate to backup │
│ │Yes │
│ ▼ │
│ Confirm Alert (ACK) │
│ ──Is false alarm?──► Optimize/Close│
│ │No │
│ ▼ │
│ Assess impact → Determine level │
│ ──P0/P1──► Build War‑Room │
│ ──P2/P3──► Parallel handling │
│ │ │
│ │ Fix → Verify → Close incident │
└─────────────────────────────────────┘Alert response checklist
## Alert Response Checklist
### Step 1: Confirm alert (within 2 min)
- [ ] Receive notification
- [ ] ACK in alert system
- [ ] Notify team channel
- [ ] Record start time
### Step 2: Initial assessment (within 5 min)
- [ ] Check for false alarm
- [ ] Evaluate impact (users, business)
- [ ] Determine severity (P0‑P3)
- [ ] Decide on escalation or War‑Room
### Step 3: Fault location
- [ ] View monitoring dashboards
- [ ] Review recent changes
- [ ] Check service logs
- [ ] Consult relevant Runbook
### Step 4: Fault handling
- [ ] Perform mitigation (restore service)
- [ ] Identify root cause
- [ ] Apply fix
- [ ] Verify resolution
### Step 5: Incident closure
- [ ] Confirm metrics back to normal
- [ ] Business confirms functionality
- [ ] Fill incident report
- [ ] Close alert and ticket
- [ ] Notify team of recoveryWar‑Room setup and operation
Conditions for War‑Room:
P0‑level incident
P1 incident unresolved for >30 min
Cross‑team fault requiring multi‑party collaboration
Technical leader deems it necessary
Roles:
┌───────────────────────────────┐
│ War‑Room Roles │
├───────────────────────────────┤
│ Incident Commander (IC) – coordinates, final decisions │
│ Technical Team – investigates, fixes │
│ Communication Coordinator – external updates │
│ Recorder – logs timeline, actions │
└───────────────────────────────────────┘Operational guidelines:
## War‑Room Operation
### Start
1. IC announces War‑Room start.
2. Summarise current status and known info.
3. Confirm participants.
4. Set communication channel (main chat, voice).
### During
1. IC gives status update every 15 min.
2. Technical discussion in sub‑channel.
3. Important decisions announced in main channel.
4. Recorder logs timeline in real time.
### End
1. Confirm full recovery.
2. Recorder compiles complete timeline.
3. IC declares War‑Room closed.
4. Schedule post‑mortem.
### Communication templates
- Status update: "[time] [status] Current: xxx, Doing: xxx, ETA: xxx"
- Request support: "Need xxx team member to assist with xxx"
- Decision announcement: "Decided to execute xxx, expected effect xxx, owner xxx"Escalation mechanism
Escalation paths
┌─────────────────────┐
│ Escalation │
├─────────────────────┤
│ Level 1: On‑call engineer (15 min unresolved or P0) │
│ → Level 2: Senior engineer (30 min unresolved) │
│ → Level 3: Technical manager (1 h unresolved) │
│ → Level 4: CTO / Director (P0 or customer/media impact) │
└─────────────────────┘Trigger conditions
L1→L2: P0/P1 not resolved within 15 min – notify second‑line engineer via phone/IM.
L2→L3: No progress after 30 min or cross‑team coordination needed – notify technical manager via phone/meeting.
L3→L4: P0 persists >1 h with media attention – notify CTO via phone/report.
Shift handoff process
Handoff schedule
┌───────────────────────────────┐
│ Shift Handoff Schedule │
├───────────────────────────────┤
│ Weekday:
│ 09:00‑09:30 – Night → Morning handoff (sync alerts, status)
│ 09:30‑18:00 – Morning shift (handle alerts, stand‑up)
│ 18:00‑18:30 – Evening handoff (sync day status, pending items)
│ 18:30‑09:00 – Night shift (handle overnight alerts, log non‑urgent issues)
│ Weekend:
│ 09:00‑21:00 – Day shift
│ 21:00‑09:00 – Night shift
└───────────────────────────────┘Handoff checklist
## Handoff Checklist
### 1. Alert status
- [ ] Number of alerts generated during shift
- [ ] Open alerts and their progress
- [ ] Issues requiring next shift attention
### 2. Incident summary
- [ ] Incidents resolved
- [ ] Ongoing incidents
- [ ] Potential risks
### 3. Change records
- [ ] Production changes made
- [ ] Planned upcoming changes
- [ ] Issues caused by changes
### 4. To‑do items
- [ ] Tasks to continue
- [ ] Open tickets
- [ ] Other pending work
### 5. Special attention
- [ ] Services needing extra monitoring
- [ ] Upcoming major events (promotions, releases)
- [ ] Other noticesPost‑mortem process
Post‑mortem meeting flow
┌─────────────────────────────────────┐
│ Post‑mortem Flow │
├─────────────────────────────────────┤
│ Pre‑mortem (within 24 h):
│ • Collect timeline, logs, metrics
│ • Draft initial report
│ • Invite participants
│
│ Meeting (within 48 h):
│ 1. Incident recap (10 min)
│ 2. Timeline walk‑through (15 min)
│ 3. Root‑cause analysis (20 min, 5‑Why)
│ 4. Improvement actions (15 min)
│ 5. Summary & action items (10 min)
│
│ Follow‑up:
│ • Publish report
│ • Create improvement tickets
│ • Track progress regularly
└─────────────────────────────────────┘Post‑mortem report template
# Incident Post‑mortem Report
## Basic Info
| Item | Details |
|------|---------|
| Title | [Brief description]
| Severity | P0‑P3
| Time | 2024‑01‑15 14:30 – 15:45 (1 h 15 min)
| Impact | Site‑wide order failure, 80 % order drop, ~XX k loss
| Owner(s) | Zhang San (lead), Li Si (support)
| Review date | 2024‑01‑16
| Author | Wang Wu |
---
## Summary
[2‑3 sentences summarising cause, impact and resolution]
---
## Timeline
| Time | Event | Owner |
|------|-------|-------|
|14:30|Alert fired: order service 5xx >10 %|System|
|14:32|ACK by on‑call engineer|Zhang San|
|14:35|Initial check: DB connection timeout|Zhang San|
|14:40|Escalated to DBA|Zhang San|
|14:50|Root cause: connection‑pool max = 5 (should be 100)|Li Si|
|15:00|Decision to roll back config|Zhang & Li|
|15:10|Executed rollback|Li Si|
|15:20|Service recovery verification|Zhang San|
|15:45|Incident closed|Zhang San|
---
## Root‑Cause Analysis (5‑Why)
1. Why orders failed? → DB connections unavailable.
2. Why no connections? → Pool exhausted.
3. Why pool exhausted? → Max connections set to 5.
4. Why mis‑set? → Ops edited production config while testing.
5. Why no guard? → No environment isolation or change‑review process.
---
## Impact assessment
- Users affected: ~100 k
- Business impact: Order loss ~XX k, 23 complaints
- SLA impact: Availability dropped from 99.95 % to 99.87 % (75 min)
---
## Improvement actions
### Short‑term (≤1 week)
| Action | Owner | Due | Status |
|--------|-------|-----|--------|
| Verify config rollback safety | Li Si | 2024‑01‑16 | Done |
| Add DB connection pool alert | Zhang San | 2024‑01‑17 | In‑progress |
| List high‑risk config items | Ops team | 2024‑01‑18 | Planned |
### Long‑term (≤1 month)
| Action | Owner | Due | Status |
|--------|-------|-----|--------|
| Implement production/test isolation | Architecture | 2024‑02‑01 | Planned |
| Establish config change approval | Ops | 2024‑02‑15 | Planned |
| Enable staged rollout for config changes | Development | 2024‑02‑28 | Planned |
---
## Lessons learned
### What went well
- Alert acknowledged within 2 min.
- Quick escalation to DBA.
- Rollback executed cleanly.
### Areas to improve
- Strengthen config‑management process.
- Add dual‑approval for critical configs.
- Improve DB connection pool monitoring.
---
## Follow‑up
- 2024‑01‑20: Verify short‑term actions.
- 2024‑02‑01: Review environment isolation.
- 2024‑03‑01: Confirm all improvements deployed.
---Runbook index (selected)
Server SSH connectivity troubleshooting
Disk space cleanup procedure
Server reboot workflow
Network connectivity troubleshooting
DNS resolution failure handling
Load‑balancer health‑check failure
MySQL high‑connection count handling
MySQL slow‑query investigation
MySQL replication lag resolution
Redis memory alert handling
Redis connection‑count high
Redis cluster node failure
Kafka consumer lag mitigation
Kafka disk alert
Kafka broker down handling
Order service 5xx error handling
Order service timeout troubleshooting
Payment callback failure handling
Payment channel exception handling
Reconciliation discrepancy handling
Common command shortcuts
# Service status shortcuts
alias k='kubectl'
alias kp='kubectl get pods'
alias kl='kubectl logs -f'
alias kd='kubectl describe'
# Quick SSH
alias prod01='ssh prod-app-01'
alias prod02='ssh prod-app-02'
# Log tailing
alias tailapp='tail -f /var/log/app/app.log'
alias tailnginx='tail -f /var/log/nginx/access.log'
# System overview
alias sysinfo='echo "=== CPU ===" && top -bn1 | head -5 && echo "
=== Memory ===" && free -h && echo "
=== Disk ===" && df -h'
# DB shortcuts
alias mysql-prod='mysql -h prod-db.example.com -u readonly -p'
alias redis-prod='redis-cli -h prod-redis.example.com -p 6379'
# Troubleshooting tools
alias httpstat='curl -w "@curl-format.txt" -o /dev/null -s'
alias tcpstat='ss -s'
alias netstat='netstat -tlnp'Glossary
On‑Call : Rotating responsibility for handling alerts and incidents.
Alert : Notification generated by monitoring when an abnormal condition is detected.
ACK : Acknowledge – confirming receipt of an alert and beginning investigation.
War Room : Dedicated coordination space for major incidents.
Incident Commander (IC) : Person who leads the response and makes final decisions.
Mitigation : Immediate actions to restore service, not necessarily fixing the root cause.
Root Cause : Underlying reason that caused the incident.
Postmortem : Review meeting after incident resolution.
Runbook : Standardised operational playbook.
SLA : Service Level Agreement.
MTTR : Mean Time To Recovery.
MTTD : Mean Time To Detect.
MaGe Linux Operations
Founded in 2009, MaGe Education is a top Chinese high‑end IT training brand. Its graduates earn 12K+ RMB salaries, and the school has trained tens of thousands of students. It offers high‑pay courses in Linux cloud operations, Python full‑stack, automation, data analysis, AI, and Go high‑concurrency architecture. Thanks to quality courses and a solid reputation, it has talent partnerships with numerous internet firms.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
