How to Build a Standardized SRE On‑Call Process: From Alert Grading to Handoff Templates
This article presents a complete SRE on‑call handbook that defines alert severity levels, provides concrete Prometheus Alertmanager configurations, outlines a step‑by‑step response flow, details war‑room roles, escalation paths, handoff checklists, post‑mortem procedures, and dozens of ready‑to‑use templates to reduce MTTR and improve reliability.
Problem Statement
Many organizations suffer from chaotic on‑call processes: alerts go unanswered, handling steps are undocumented, and handoffs rely on oral communication. This leads to repeated incidents, long MTTR and steep onboarding curves for new engineers.
Solution Overview
A complete SRE on‑call template was built covering alert grading, response procedures, escalation paths, handoff standards and post‑mortem workflow. After two years of production use the team measured:
Average alert acknowledgement time reduced from 15 minutes to 3 minutes .
Repeat‑failure rate dropped by 60 % .
Alert Grading Pyramid
┌─────────────────────────────────────────────────────────────────┐
│ Alert Grading Pyramid │
├─────────────────────────────────────────────────────────────────┤
│ ▲ P0 Major outage │
│ /│\ Full‑site outage, 5‑min response │
│ / │ \ Core‑business interruption │
│ / │ \ 5‑min response │
│ /────┼────\ │
│ / │ \ P1 Severe fault │
│ / │ \ Core function impaired │
│ / │ \ 15‑min response │
│ /─────────┼─────────\ │
│ / │ \ P2 General fault │
│ / │ \ Non‑core impact │
│ / │ \ 30‑min response │
│ /─────────────┼─────────────\ │
│ / │ \ P3 Minor issue │
│ / │ \ Performance drop │
│ / │ \ 4‑hour response │
└─────────────────────────────────────────────────────────────────┘P0 – Major Outage
Definition: Entire site or core business unavailable, directly affecting revenue or reputation.
Typical scenarios: Full site down, payment system unavailable, core DB crash, major security breach.
Response requirement: Acknowledge within 5 minutes , immediate escalation to technical director and business owner, launch a war‑room.
P1 – Severe Fault
Definition: Core functionality severely degraded, affecting many users.
Typical scenarios: Login failure, core API timeout, data‑sync delay, regional outage.
Response requirement: Acknowledge within 15 minutes , if unresolved after 30 minutes escalate to technical manager.
P2 – General Fault
Definition: Non‑core feature abnormal, affecting a subset of users.
Typical scenarios: Admin panel error, non‑critical service degradation, monitoring failure, log collection interruption.
Response requirement: Acknowledge within 30 minutes , if unresolved after 2 hours escalate to technical manager.
P3 – Minor Issue
Definition: Performance drop or potential risk that does not affect normal usage.
Typical scenarios: High CPU/Memory, disk space warning, SSL certificate nearing expiry, acceptable performance degradation.
Response requirement: Acknowledge within 4 hours (working hours) , no escalation required, record ticket for later handling.
Alertmanager Configuration Example (Prometheus)
# Prometheus Alertmanager configuration example
route:
receiver: 'default'
receivers:
- name: 'p0-critical'
webhook_configs:
- url: 'http://alert-gateway/webhook/phone' # phone
- url: 'http://alert-gateway/webhook/sms' # SMS
- url: 'http://alert-gateway/webhook/wechat' # WeChat
- name: 'p1-high'
webhook_configs:
- url: 'http://alert-gateway/webhook/sms'
- url: 'http://alert-gateway/webhook/wechat'
- name: 'p2-medium'
webhook_configs:
- url: 'http://alert-gateway/webhook/wechat'
- name: 'p3-low'
webhook_configs:
- url: 'http://alert-gateway/webhook/wechat-silent'
routes:
- match:
severity: P0
receiver: 'p0-critical'
continue: false
group_wait: 0s
repeat_interval: 5m
- match:
severity: P1
receiver: 'p1-high'
continue: false
group_wait: 30s
repeat_interval: 15m
- match:
severity: P2
receiver: 'p2-medium'
continue: false
group_wait: 1m
repeat_interval: 30m
- match:
severity: P3
receiver: 'p3-low'
continue: false
group_wait: 5m
repeat_interval: 4hAlert Response Process
The flow can be visualised as a decision tree (ASCII art kept for clarity):
┌─────────────────────────────────────────────────────────────────────┐
│ Alert Response Flow │
├─────────────────────────────────────────────────────────────────────┤
│ ┌──────────┐ │
│ │ Alert │ │
│ └────┬─────┘ │
│ │ │
│ ▼ │
│ ┌─────────────┐ No ┌─────────────┐ │
│ │ On‑call ├────────►│ Escalation │ │
│ │ 5‑min ACK │ │ backup │ │
│ └──────┬──────┘ └──────┬──────┘ │
│ │ Yes │ │
│ ▼ ▼ │
│ ┌─────────────┐ Yes ┌─────────────┐ No │
│ │ Confirm │◄────────│ Is false‑ │◄───────────────────│
│ │ Alert │ │ alarm? │ │
│ └──────┬──────┘ └──────┬──────┘ │
│ │ │ │
│ ▼ ▼ │
│ ┌─────────────┐ P0/P1 ┌─────────────┐ │
│ │ Level │──────────►│ Build War │ │
│ │ judgment │ │ Room │ │
│ └──────┬──────┘ └──────┬──────┘ │
│ │ │ │
│ ▼ ▼ │
│ ┌─────────────┐ P2/P3 ┌─────────────┐ │
│ │ Check Run‑ │◄──────────│ Continue │ │
│ │ book │ │ processing │ │
│ └──────┬──────┘ └──────┬──────┘ │
│ │ │ │
│ ▼ ▼ │
│ ┌─────────────────────┐ ┌─────────────────────┐ │
│ │ Fault handling: │ │ Fault handling: │ │
│ │ • Stop‑bleed │ │ • Root‑cause fix │ │
│ │ • Locate cause │ │ • Verify fix │ │
│ │ • Repair │ │ │ │
│ └───────┬─────────────┘ └───────┬─────────────┘ │
│ │ │ │
│ ▼ ▼ │
│ ┌─────────────────────┐ ┌─────────────────────┐ │
│ │ Recovery confirm │ │ Recovery confirm │ │
│ └───────┬─────────────┘ └───────┬─────────────┘ │
│ │ │ │
│ ▼ ▼ │
│ ┌─────────────────────┐ ┌─────────────────────┐ │
│ │ Close ticket & │ │ Close ticket & │ │
│ │ write post‑mortem │ │ write post‑mortem │ │
│ └─────────────────────┘ └─────────────────────┘ │
└─────────────────────────────────────────────────────────────┘Alert Response Checklist
## Alert Response Checklist
### Step 1: Confirm alert (within 2 min)
- [ ] Receive alert notification
- [ ] ACK the alert in the system
- [ ] Notify the team channel of takeover
- [ ] Record start time
### Step 2: Initial assessment (within 5 min)
- [ ] Verify if it is a false alarm
- [ ] Evaluate impact scope (users, business)
- [ ] Determine alert level (P0‑P3)
- [ ] Decide whether escalation or war‑room is needed
### Step 3: Fault location
- [ ] Review monitoring dashboards
- [ ] Check recent change logs
- [ ] Examine service logs
- [ ] Consult relevant runbook
### Step 4: Fault handling
- [ ] Perform stop‑bleed actions (restore service)
- [ ] Locate root cause
- [ ] Implement fix
- [ ] Verify remediation
### Step 5: Fault closure
- [ ] Confirm metrics have returned to normal
- [ ] Get business confirmation
- [ ] Fill fault handling report
- [ ] Close alert and ticket
- [ ] Notify team of recoveryWar Room
When to Create a War Room
P0 fault.
P1 fault unresolved for >30 minutes.
Cross‑team incident requiring multi‑team collaboration.
Technical leader deems it necessary.
Role Distribution
Incident Commander (IC): Leads the response, makes final decisions, coordinates all participants.
Technical Handling Team: Performs root‑cause analysis and remediation.
Communication Coordinator: Handles external updates and status syncs.
Recorder: Maintains a real‑time timeline of actions and decisions.
Operating Guidelines
Meeting start: IC announces war‑room start, summarizes current status, confirms participants, sets communication channels (main chat + voice).
During meeting (every 15 min): Technical discussion in a sub‑channel, major decisions announced in the main channel, recorder updates timeline.
Communication templates:
Status update: "[time] [status] Current: xxx, actions: xxx, ETA: xxx"
Support request: "Need xxx team member to assist with xxx"
Decision announcement: "Decide to execute xxx, expected effect xxx, owner xxx"
Meeting end: Confirm full recovery, recorder finalises timeline, IC declares war‑room closed, schedule post‑mortem.
Escalation Mechanism
Level 1 – Front‑line on‑call: Owner – current on‑call engineer. Duties – respond, initial handling, record. Escalate if no resolution in 15 min or P0 severity.
Level 2 – Secondary support: Owner – senior engineer / backup. Duties – handle complex issues, make technical decisions. Escalate if no resolution in 30 min or cross‑team needed.
Level 3 – Technical management: Owner – tech manager / architect. Duties – resource allocation, cross‑team coordination. Escalate if no resolution in 1 hour or major impact.
Level 4 – Executive level: Owner – CTO / technical director. Duties – business decisions, external communication. Triggered by P0 incident affecting customers or media.
Escalation Triggers (Concrete Examples)
L1 → L2: P0/P1 alert with no progress after 15 minutes. Target – secondary engineer. Method – phone + instant‑messaging.
L2 → L3: No progress after 30 minutes, requires cross‑team effort. Target – technical manager. Method – phone + conference call.
L3 → L4: P0 incident, no progress after 1 hour, media attention. Target – technical director. Method – phone + formal report.
On‑Call Handoff Process
Handoff Timing (Typical Weekday Schedule)
09:00‑09:30 – Early shift handoff (night → morning).
09:30‑18:00 – Morning shift duty.
18:00‑18:30 – Evening shift handoff (morning → evening).
18:30‑09:00 – Night shift duty.
Weekend example: 09:00‑21:00 day shift, 21:00‑09:00 night shift.
Handoff Checklist
Alert status: Number of alerts, unresolved alerts and their progress, items to be followed up.
Fault status: Resolved faults, ongoing incidents, potential risk warnings.
Change records: Production changes during shift, upcoming planned changes, issues caused by changes.
To‑do items: Tasks needing continued follow‑up, open tickets, other pending items.
Special attention: Services/monitors needing watch, upcoming major events (e.g., promotions), other notes.
Handoff Report Template (Markdown‑style for readability)
# On‑Call Handoff Report
**Handoff Time**: 2024-01-15 18:00
**From**: Zhang San (morning) → Li Si (evening)
---
## 1. Today’s Alert Summary
| Time | Alert Name | Level | Status | Result |
|------|------------|-------|--------|--------|
| 10:23 | MySQL CPU high | P2 | Closed | Optimised slow queries |
| 14:45 | API timeout | P1 | Closed | Upstream timeout resolved |
| 17:30 | Disk space warning | P3 | In‑progress | Log cleanup pending |
---
## 2. Ongoing Issues
### 1. Disk space cleanup
- **Server**: prod‑app‑01
- **Current status**: Identified log directories to clean
- **Action**: Execute cleanup script (link to runbook)
### 2. Monitoring alert optimisation
- **Problem**: MySQL CPU alert threshold too sensitive
- **Planned action**: Adjust threshold from 80 % to 85 % (ticket OPS‑2024‑0115)
---
## 3. Planned Changes
| Time | Change | Impact | Owner |
|------|--------|--------|-------|
| 20:00 | Order service release | Order module | Wang Wu |
| 22:00 | DB backup | Read‑only DB brief jitter | DBA |
---
## 4. Special Focus
- Payment service experienced a hiccup yesterday – monitor closely.
- Tomorrow morning big promotion expected to increase traffic by 50 %.
- SSL certificate for api.example.com expires next Friday – ticket opened.
---
## 5. Handoff Confirmation
- [ ] Alert system switched to evening on‑call.
- [ ] PagerDuty schedule updated.
- [ ] Handoff verbally confirmed.
**Handoff signer**: Zhang San
**Receiver signer**: Li Si
---Post‑Mortem Process
Conducted within 48 hours of incident resolution. The workflow includes preparation, a structured meeting, root‑cause analysis (5‑Why), improvement actions and follow‑up tracking.
Post‑Mortem Report Template
# Incident Post‑Mortem Report
## Basic Info
| Item | Content |
|------|---------|
| Title | Order service outage |
| Level | P0 |
| Time | 2024‑01‑15 14:30‑15:45 (1 h 15 min) |
| Impact | Full‑site order failure |
| Metric impact | Order volume ↓80 %, loss ≈ XX M RMB |
| Handlers | Zhang San (lead), Li Si (assist) |
| Review date | 2024‑01‑16 |
| Author | Wang Wu |
---
## Incident Summary
The order service could not obtain a DB connection, causing all order requests to fail.
---
## Timeline
| Time | Event | Owner |
|------|-------|-------|
| 14:30 | Alert triggered: order service 5xx >10 % | System |
| 14:32 | ACK by on‑call engineer Zhang San |
| 14:35 | Initial check: DB connection timeout |
| 14:40 | Escalated to DBA Li Si |
| 14:50 | Root cause: connection‑pool max‑conn set to 5 |
| 15:00 | Decision to roll back config |
| 15:10 | Executed rollback |
| 15:20 | Verified service recovery |
| 15:45 | Confirmed full recovery, closed alert |
---
## 5‑Why Root‑Cause Analysis
1. **Why** did the order service fail? → DB connections unavailable.
2. **Why** were DB connections unavailable? → Connection pool exhausted.
3. **Why** was the pool exhausted? → Max connections set to 5.
4. **Why** was max set to 5? → Mis‑applied config change intended for test env.
5. **Why** was test config applied to prod? → No environment isolation and missing change‑review.
**Direct cause**: Connection‑pool max‑conn mis‑configuration.
**Fundamental causes**:
- Production/test config not isolated.
- Change‑review workflow missing.
- No gray‑release for config changes.
---
## Impact Assessment
### User impact
- ~100 k users affected, order and payment unavailable.
- 23 user complaints logged.
### Business impact
- Estimated loss: XX M RMB.
- SLA dropped from 99.95 % to 99.87 % (75 min outage).
---
## Improvement Measures
### Short‑term (≤1 week)
| Action | Owner | Due | Status |
|--------|-------|-----|--------|
| Verify config rollback has no side effects | Li Si | 2024‑01‑16 | Done |
| Add connection‑pool alert (threshold 80 %) | Zhang San | 2024‑01‑17 | In‑progress |
| Compile high‑risk config list | Ops team | 2024‑01‑18 | Pending |
### Long‑term (≤1 month)
| Action | Owner | Due | Status |
|--------|-------|-----|--------|
| Implement prod/test config isolation | Architecture | 2024‑02‑01 | Pending |
| Establish change‑review process | Ops | 2024‑02‑15 | Pending |
| Enable gray‑release for config changes | Development | 2024‑02‑28 | Pending |
---
## Lessons Learned
### What went well
- Alert ACK within 2 min.
- Quick escalation to DBA.
- Controlled rollback restored service.
### Areas to improve
- Strengthen configuration‑management workflow.
- Require dual‑approval for core config changes.
- Improve connection‑pool monitoring.
---
## Follow‑up
- Verify short‑term actions by 2024‑01‑20.
- Review environment isolation progress by 2024‑02‑01.
- Ensure all improvements are in place by 2024‑03‑01.
---
## References
<ul>
<li>Google SRE Book – Site Reliability Engineering.</li>
<li>PagerDuty Incident Response – Best Practices.</li>
<li>Atlassian Incident Management – Guide.</li>
</ul>Key Takeaways
Graded alerts enable appropriate response times and resource allocation.
Standardised checklists and runbooks reduce human error and speed up MTTR.
War‑room coordination provides a clear command structure for high‑severity incidents.
Escalation ladder ensures issues are handed to the right expertise level.
Structured handoff guarantees continuity across shift changes.
Post‑mortem analysis (5‑Why, improvement tickets) drives continuous reliability improvement.
Implementation Recommendations
Start with the alert grading pyramid and the response checklist; integrate them into your existing alerting system.
Adopt the escalation ladder and define clear on‑call rotation schedules.
Document critical runbooks for top‑priority services and store them in a searchable knowledge base.
Run regular war‑room drills to validate the process and refine role responsibilities.
Schedule monthly post‑mortem reviews and track improvement tickets to close the loop.
By following this end‑to‑end on‑call framework, teams can achieve faster MTTR, lower repeat‑failure rates, and a culture of continuous reliability improvement.
Raymond Ops
Linux ops automation, cloud-native, Kubernetes, SRE, DevOps, Python, Golang and related tech discussions.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
