Operations 20 min read

SRE Playbook: From Alert to Full Recovery of Service Avalanches

This comprehensive SRE guide walks through a real-world service avalanche incident, detailing alert triggering, root‑cause analysis, step‑by‑step recovery, capacity baseline creation, layered alert design, automated scripts, and post‑mortem best practices to help engineers prevent and resolve large‑scale outages.

MaGe Linux Operations

Oct 16, 2025

SRE Playbook: From Alert to Full Recovery of Service Avalanches

Introduction

In micro‑service architectures, a single point of failure can trigger a global avalanche within minutes. An e‑commerce platform once suffered a cascade when a coupon service’s slow database query caused over 20 downstream services to time‑out, dropping peak QPS from 30 k to 200 and losing more than ¥8 million GMV.

This article targets SREs and senior operations engineers, offering a complete post‑mortem of that avalanche and a systematic methodology covering alert response, root‑cause identification, emergency recovery, and capacity planning. Readers will obtain reusable diagnostic commands, layered SOPs, capacity‑baseline generation via load testing, and 20 best‑practice recommendations for high‑traffic distributed systems.

Technical Background

Nature and Propagation of Service Avalanches

Service Avalanche refers to a chain reaction where the failure of one service propagates through synchronous call chains, eventually rendering the entire system unavailable. Its core mechanisms are:

Resource Exhaustion Amplification : Upstream services accumulate requests because downstream responses are slow, filling thread/connection pools and blocking new traffic.

Failure Propagation Path : Synchronous RPC calls (gRPC/Dubbo/HTTP) without timeouts or circuit breakers spread the fault upstream.

Avalanche Triple‑Factor : High coupling (dependency depth ≥ 3), lack of isolation (shared thread pools), and traffic spikes (> 30 % above baseline).

According to the Google SRE Book, MTTR for avalanche incidents is typically 5‑10 times that of a single‑point failure because the fault area is larger and root‑cause analysis requires cross‑service collaboration.

Layered Alert Design

A standard alert system should cover four layers (based on Prometheus best practices):

L1‑Business : Metrics – success rate, latency, traffic; Example threshold – P99 latency > 500 ms for 2 min; Response time – within 5 min.

L2‑Application : JVM heap, thread count, GC duration; Example – Old GC > 5 s or > 10 times/min; Response – within 10 min.

L3‑Middleware : Redis connections, MySQL slow queries; Example – slow‑query > 100 ms ratio > 10 %; Response – within 15 min.

L4‑Infrastructure : CPU, memory, disk I/O, network; Example – CPU iowait > 30 % for 5 min; Response – within 30 min.

Key concepts:

SLI (Service Level Indicator) : Quantifiable quality metric, e.g., availability = successful requests / total requests.

SLO (Service Level Objective) : Target value for an SLI, e.g., monthly availability ≥ 99.95 %.

Error Budget : Tolerable failure budget, e.g., 21.6 min downtime per month for 99.95 % availability.

Capacity Planning and Load‑Test Baseline

A capacity baseline defines the performance envelope under normal load, obtained through end‑to‑end load testing. Core metrics include:

QPS Upper Limit : Maximum stable requests per second while CPU < 70 % and P99 latency stays below threshold.

Concurrent Connections : Maximum TCP connections handled by Nginx/Tomcat.

Resource Watermarks : Safe thresholds for CPU, memory, and network bandwidth (typically 70 % triggers scaling alerts).

Recommended load‑testing tools: JMeter 5.5 (Java apps), wrk2 (HTTP benchmark), Ali‑PTS (full‑link testing platform).

Core Content

Environment Prerequisites

The case study assumes the following production environment (adjust parameters for different versions):

Ubuntu 22.04 LTS (kernel 5.15.0‑91‑generic)

RHEL 9.2 (kernel 5.14.0‑284.el9.x86_64)

Kernel tuning (see code block below)

# 查看当前内核版本
uname -r

# 关键内核参数（/etc/sysctl.conf）
net.core.somaxconn = 32768          # TCP full‑connection queue
net.ipv4.tcp_max_syn_backlog = 8192 # SYN half‑queue
net.ipv4.ip_local_port_range = 1024 65535
net.ipv4.tcp_tw_reuse = 1
vm.swappiness = 10
sysctl -p

Monitoring stack versions:

Prometheus 2.40.5

Grafana 9.5.2

Node Exporter 1.5.0

cAdvisor 0.47.0

AlertManager 0.25.0

Alert Rule Configuration

Example Prometheus rule file ( /etc/prometheus/rules/service-avalanche.yml):

groups:
- name: service_avalanche_detection
  interval: 30s
  rules:
  # L1 business alert: error rate > 1%
  - alert: HighErrorRate
    expr: (sum(rate(http_requests_total{code=~"5.."}[2m])) by (service) /
           sum(rate(http_requests_total[2m])) by (service)) > 0.01
    for: 2m
    labels:
      severity: critical
      layer: business
    annotations:
      summary: "Service {{ $labels.service }} error rate exceeds 1%"
      runbook: "https://wiki.company.com/runbook/high-error-rate"

  # L2 application alert: thread‑pool exhaustion
  - alert: ThreadPoolExhausted
    expr: tomcat_threads_busy_threads / tomcat_threads_max_threads > 0.9
    for: 3m
    labels:
      severity: warning
      layer: application

Validate rules and reload Prometheus:

# Check syntax
promtool check rules /etc/prometheus/rules/service-avalanche.yml

# Reload without restart
curl -X POST http://localhost:9090/-/reload

# List firing alerts
curl http://localhost:9090/api/v1/alerts | jq '.data.alerts[] | select(.state=="firing")'

Full Fault‑Localization Workflow

Step 1 – Identify Affected Service

# Count firing alerts in the last 5 min (determine avalanche)
curl -s http://prometheus:9090/api/v1/alerts | \
  jq '[.data.alerts[] | select(.state=="firing")] | length'

Step 2 – Access Faulty Node

# Get node IP of high‑latency pod
kubectl describe pod order-service-7d9f8b-xkz2p -n production | grep "Node:"

# SSH into node (example 10.0.1.23)
ssh [email protected]

# Find Java process PID
ps aux | grep java | grep order-service

# Monitor resources
top -p <PID>

# Dump thread stack to locate blockage
sudo jstack <PID> | grep -A 20 "BLOCKED\|WAITING"

Emergency Recovery Operations

Operation 1 – Temporary Coupon Feature Rollback

# Disable coupon feature via configuration center (Apollo/Nacos)
curl -X POST 'http://apollo-config:8080/openapi/v1/envs/PRD/apps/order-service/clusters/default/namespaces/application/items/feature.coupon.enabled' \
  -H 'Authorization: <TOKEN>' \
  -d '{"key":"feature.coupon.enabled","value":"false","comment":"Avalanche emergency downgrade"}'

# Verify change (effective within 10 s)
curl http://order-service:8080/actuator/env | jq '.propertySources[] | select(.name=="Apollo") | .properties."feature.coupon.enabled"'

Operation 2 – Restart Blocked Service

# Rolling restart in Kubernetes (avoid full restart)
kubectl rollout restart deployment/order-service -n production

# Watch rollout status
kubectl rollout status deployment/order-service -n production

# Verify P99 latency returns to normal
curl -s 'http://prometheus:9090/api/v1/query?query=histogram_quantile(0.99, rate(http_request_duration_seconds_bucket{service="order-service"}[2m]))' | \
  jq '.data.result[0].value[1]'

Practical Case Study

Business Background and Bottlenecks

During a “Black Friday” promotion, a cross‑border e‑commerce platform expected three‑fold traffic (peak QPS ≈ 45 k). The main issues were:

No capacity baseline – only daily traffic estimates, no full‑link load test.

Static alert thresholds – e.g., CPU > 80 % ignored business‑level fluctuations.

Fragile dependencies – coupon service lacked circuit breaking and had unoptimized DB queries.

Design and Optimization Measures

Database Optimization

-- Analyze top 10 slow queries from slow‑log
pt‑query‑digest /var/log/mysql/slow.log | head -50

-- Add targeted index
ALTER TABLE coupons
  ADD INDEX idx_user_status_expire (user_id, status, expire_time)
  ALGORITHM=INPLACE, LOCK=NONE;

-- Verify index usage
EXPLAIN SELECT * FROM coupons WHERE user_id=12345 AND status='active'\G

Cache Warm‑up and Degradation Switch

# Redis cache warm‑up script (run 1 h before promotion)
#!/bin/bash
REDIS_HOST="redis-cluster.prod"
MYSQL_HOST="mysql-primary.prod"

mysql -h $MYSQL_HOST -uapp -p<PASSWORD> -e \
  "SELECT sku_id, name, price, stock FROM products ORDER BY sales DESC LIMIT 1000" | \
while IFS=$'\t' read -r sku name price stock; do
  redis-cli -h $REDIS_HOST SETEX "product:${sku}" 3600 \
    "{\"name\":\"$name\",\"price\":$price,\"stock\":$stock}"
done

Nginx Rate‑Limiting

# Limit per‑IP QPS
limit_req_zone $binary_remote_addr zone=perip:10m rate=50r/s;

# Limit per‑URI QPS
limit_req_zone $request_uri zone=peruri:10m rate=5000r/s;

server {
  location /api/order/create {
    limit_req zone=perip burst=10 nodelay;
    limit_req zone=peruri burst=1000 nodelay;
    limit_req_status 429;
    proxy_pass http://order-service;
  }
}

Best‑Practice Checklist

Pre‑Incident (10 items)

Establish capacity baseline with quarterly full‑link load tests.

Implement four‑layer alert system with distinct SLA response times.

Use dynamic thresholds (Prometheus predict_linear) to avoid static false alarms.

Configure circuit breakers and degradation switches for all external dependencies.

Map service call topology, annotate timeout and retry policies.

Automate load testing in CI/CD pipelines.

Standardize canary releases (1 % → 10 % → 50 % → 100 %).

Enforce change‑window policy (core services only during low‑traffic periods).

Maintain runbook documentation for each alert.

Conduct monthly chaos‑engineering drills.

During Incident (5 items)

Tiered response: P0 (avalanche) within 5 min, P1 within 15 min, P2 within 30 min.

Stop‑gap first – degrade, rate‑limit, or restart; root‑cause later.

Multi‑dimensional diagnosis: Prometheus, ELK logs, Jaeger tracing, DB slow‑query.

Freeze non‑emergency changes.

Provide progress updates to business owners every 30 min.

Post‑Incident (5 items)

Hold a blameless post‑mortem within 72 h, produce 5‑Why analysis and improvement actions.

Retain all monitoring data for at least 90 days.

Automate manual steps (auto‑scaling, auto‑degradation).

Update runbooks and wiki; incorporate case into new‑hire training.

Adjust release cadence based on error‑budget consumption.

Summary and Outlook

The core challenge of service avalanches is that fault propagation outpaces human response. This guide demonstrates a full lifecycle: pre‑emptive capacity baselines and layered alerts, rapid in‑flight mitigation through degradation and restarts, and systematic post‑mortem to eliminate repeat failures.

Future directions include AI‑driven self‑healing, regular chaos engineering, eBPF‑based kernel observability, and full service‑mesh adoption (Istio) for unified traffic control, circuit breaking, and retries.

FAQ

Q1: How to set reasonable circuit‑breaker thresholds?

Use historical percentile values: set the break threshold at 1.5 × P99 latency from the past 30 days; error‑rate threshold at three times the daily average. Validate in a load‑test environment to avoid false trips.

Q2: During an avalanche, which features should be degraded?

Prioritize by business value and technical complexity: keep order/payment, downgrade coupons to a fixed discount, and fully disable recommendation/comments.

Q3: How to avoid impacting production when load‑testing?

Use an isolated test environment with a sanitized data copy; if testing in production, run at low‑traffic hours, limit load to < 10 % of live traffic, and tag requests with a special header for backend discard.

Q4: How to roll back a faulty Kubernetes version quickly?

kubectl rollout undo deployment/order-service --to-revision=3 -n production

Keep the last ten image versions; rollback should complete in under three minutes.

Q5: Best practices for distributed tracing?

Sample 1 % of traces under normal conditions; automatically increase to 100 % when an alert fires. Record business‑level fields in spans for easy filtering. Auto‑alert on traces where P99 latency exceeds the threshold.

Q6: How to evaluate fault‑handling effectiveness?

Key metrics: MTTD < 2 min, MTTR < 15 min, impact < 5 % of traffic, repeat‑failure rate ≈ 0. Track trends to gauge team improvement.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

SRE Alerting capacity planning incident response Service Avalanche

Written by

MaGe Linux Operations

Founded in 2009, MaGe Education is a top Chinese high‑end IT training brand. Its graduates earn 12K+ RMB salaries, and the school has trained tens of thousands of students. It offers high‑pay courses in Linux cloud operations, Python full‑stack, automation, data analysis, AI, and Go high‑concurrency architecture. Thanks to quality courses and a solid reputation, it has talent partnerships with numerous internet firms.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.