Operations 19 min read

How to Master Service Avalanche Recovery: A Complete SRE Playbook from Alert to Restoration

This guide walks SRE and senior operations engineers through a real-world service‑avalanche incident, detailing alert hierarchy design, fault‑location commands, emergency SOPs, capacity‑baseline building, and post‑mortem best practices to dramatically reduce MTTR in distributed micro‑service environments.

Raymond Ops
Raymond Ops
Raymond Ops
How to Master Service Avalanche Recovery: A Complete SRE Playbook from Alert to Restoration

Introduction

In micro‑service architectures a single failure can cascade into a full‑system outage within minutes. An e‑commerce platform experienced a service avalanche when a slow query in the coupon service blocked downstream calls, dropping peak QPS from 30 000 to 200 and causing multi‑million‑yuan loss. The incident highlighted three core gaps: delayed alerts, unclear fault‑location paths, and missing standardized recovery procedures.

Technical Background

Service Avalanche Mechanism

Failure propagation occurs through synchronous call chains (gRPC/Dubbo/HTTP) that lack timeouts or circuit‑breakers. Key factors are:

Resource‑exhaustion amplification – upstream services accumulate requests while downstream responses are slow, filling thread/connection pools.

Fault‑propagation path – synchronous RPC calls spread the error upstream.

Avalanche triad – deep dependency (≥ 3 layers), no isolation, and traffic spikes > 30 % above baseline.

Google SRE notes that MTTR for avalanche incidents is typically 5‑10× that of isolated failures because multiple services must be analyzed.

Layered Alert Design

A four‑layer alert hierarchy (Prometheus best practice) is recommended:

L1‑Business : monitor success rate, latency, traffic; trigger when P99 latency > 500 ms for 2 min, respond within 5 min.

L2‑Application : monitor JVM heap, thread count, GC duration; trigger when old‑GC > 5 s or > 10 times/min, respond within 10 min.

L3‑Middleware : monitor Redis connections, MySQL slow queries; trigger when slow‑query > 100 ms accounts for > 10 % of queries, respond within 15 min.

L4‑Infrastructure : monitor CPU, memory, disk I/O, network; trigger when CPU iowait > 30 % for 5 min, respond within 30 min.

Key concepts: SLI, SLO, and error budget.

Capacity Planning & Load‑Testing Baseline

Establish a performance envelope under normal load via full‑link load testing. Track QPS ceiling, concurrent connections, and resource watermarks (CPU/Memory/Network ~70 % usage). Recommended tools: JMeter 5.5, wrk2, Ali‑PTS.

Environment Prerequisites

Operating Systems: Ubuntu 22.04 LTS (kernel 5.15.0‑91‑generic), RHEL 9.2 (kernel 5.14.0‑284.el9.x86_64).

Kernel tuning (add to /etc/sysctl.conf):

net.core.somaxconn = 32768          # TCP listen backlog
net.ipv4.tcp_max_syn_backlog = 8192 # SYN half‑open backlog
net.ipv4.ip_local_port_range = 1024 65535
net.ipv4.tcp_tw_reuse = 1
vm.swappiness = 10

Monitoring stack versions:

Prometheus 2.40.5

Grafana 9.5.2

Node Exporter 1.5.0

cAdvisor 0.47.0

AlertManager 0.25.0

Alert Rule Configuration

Example Prometheus rule file ( /etc/prometheus/rules/service-avalanche.yml):

groups:
- name: service_avalanche_detection
  interval: 30s
  rules:
    # L1 business alert – error rate > 1%
    - alert: HighErrorRate
      expr: |
        (sum(rate(http_requests_total{code=~"5.."}[2m])) by (service) /
         sum(rate(http_requests_total[2m])) by (service)) > 0.01
      for: 2m
      labels:
        severity: critical
        layer: business
      annotations:
        summary: "Service {{ $labels.service }} error rate exceeds 1%"
        runbook: "https://wiki.company.com/runbook/high-error-rate"
    # L2 application alert – thread‑pool exhaustion
    - alert: ThreadPoolExhausted
      expr: |
        tomcat_threads_busy_threads / tomcat_threads_max_threads > 0.9
      for: 3m
      labels:
        severity: warning
        layer: application

Validate and reload:

# Syntax check
promtool check rules /etc/prometheus/rules/service-avalanche.yml
# Reload without restart
curl -X POST http://localhost:9090/-/reload
# List currently firing alerts
curl http://localhost:9090/api/v1/alerts | jq '.data.alerts[] | select(.state=="firing")'

Fault‑Location Process

Step 1 – Identify Affected Service

# Count firing alerts in the last 5 min to detect avalanche
curl -s http://prometheus:9090/api/v1/alerts | \
  jq '[.data.alerts[] | select(.state=="firing")] | length'

Step 2 – Analyze Faulty Node

# Find node IP of a high‑latency pod
kubectl describe pod order-service-7d9f8b-xkz2p -n production | grep "Node:"
# SSH to the node (example 10.0.1.23)
ssh [email protected]
# Locate Java process PID
ps aux | grep java | grep order-service
# Real‑time resource view
top -p PID
# Dump thread stack to find BLOCKED/WAITING frames
sudo jstack PID | grep -A 20 "BLOCKED\|WAITING"

Emergency Recovery Operations

Operation 1 – Temporary Coupon Degradation

# Disable coupon feature via configuration center (Apollo/Nacos)
curl -X POST \
  'http://apollo-config:8080/openapi/v1/envs/PRD/apps/order-service/clusters/default/namespaces/application/items/feature.coupon.enabled' \
  -H 'Authorization: TOKEN' \
  -d '{"key":"feature.coupon.enabled","value":"false","comment":"avalanche emergency downgrade"}'
# Verify change (within ~10 s)
curl http://order-service:8080/actuator/env | jq '.propertySources[] | select(.name=="Apollo") | .properties."feature.coupon.enabled"'

Operation 2 – Rolling Restart of Blocked Service

# Kubernetes rolling restart (avoids full pod recreation)
kubectl rollout restart deployment/order-service -n production
# Monitor rollout status
kubectl rollout status deployment/order-service -n production
# Verify P99 latency returns to normal
curl -s 'http://prometheus:9090/api/v1/query?query=histogram_quantile(0.99, rate(http_request_duration_seconds_bucket{service="order-service"}[2m]))' | \
  jq '.data.result[0].value[1]'

Practical Case Study

Business Load & Bottlenecks

During a “Black Friday” promotion the platform expected three‑times daily traffic (peak QPS ≈ 45 k). Identified gaps:

No full‑link capacity baseline.

Static alert thresholds (e.g., CPU > 80 %) ignored business spikes.

Coupon service lacked circuit‑breaker and contained unoptimized DB queries.

Optimizations

Database Indexing

-- Analyse top‑10 slow queries from MySQL slow‑log
pt-query-digest /var/log/mysql/slow.log | head -50
-- Add composite index to accelerate coupon lookups
ALTER TABLE coupons ADD INDEX idx_user_status_expire (user_id, status, expire_time) ALGORITHM=INPLACE, LOCK=NONE;
-- Verify index usage
EXPLAIN SELECT * FROM coupons WHERE user_id=12345 AND status='active'\G

Cache Warm‑up Script

#!/bin/bash
REDIS_HOST="redis-cluster.prod"
MYSQL_HOST="mysql-primary.prod"
# Load top‑1000 SKUs and cache for 1 hour
mysql -h $MYSQL_HOST -uapp -pPASSWORD -e "SELECT sku_id, name, price, stock FROM products ORDER BY sales DESC LIMIT 1000" | \
while IFS=$'\t' read -r sku name price stock; do
  redis-cli -h $REDIS_HOST SETEX "product:${sku}" 3600 "{\"name\":\"$name\",\"price\":$price,\"stock\":$stock}";
done

Nginx Rate Limiting

# Per‑IP QPS limit
limit_req_zone $binary_remote_addr zone=perip:10m rate=50r/s;
# Global API QPS limit
limit_req_zone $request_uri zone=peruri:10m rate=5000r/s;
server {
  location /api/order/create {
    limit_req zone=perip burst=10 nodelay;
    limit_req zone=peruri burst=1000 nodelay;
    limit_req_status 429;
    proxy_pass http://order-service;
  }
}

Best‑Practice Checklist

Pre‑Incident (10 items)

Quarterly full‑link load tests to establish capacity baseline (QPS ceiling, latency percentiles, resource watermarks).

Four‑layer alert hierarchy with distinct SLA response times.

Dynamic thresholds using predict_linear in Prometheus.

Circuit‑breakers and degradation switches for all external dependencies.

Service call topology map with timeout and retry policies.

Automate load testing in CI/CD pipelines.

Standardized canary release flow (1 % → 10 % → 50 % → 100 %).

Schedule core‑service changes during low‑traffic windows.

Maintain runbook documentation for each alert type.

Monthly chaos‑engineering drills.

During‑Incident (5 items)

Tiered response: P0 (avalanche) within 5 min, P1 within 15 min, P2 within 30 min.

First stop‑bleed (degrade, rate‑limit, restart) then root‑cause analysis.

Multi‑dimensional diagnosis using Prometheus, ELK logs, Jaeger tracing, and DB slow‑query logs.

Freeze non‑essential changes for the incident duration.

Provide status updates to business stakeholders every 30 min.

Post‑Incident (5 items)

Blameless post‑mortem within 72 h; produce 5‑Why analysis and action items.

Retain all monitoring data for at least 90 days.

Automate manual steps (auto‑scaling, auto‑degradation).

Update runbooks and internal wiki; incorporate the case into onboarding.

Adjust error‑budget allocation and release cadence based on monthly fault budget.

Future Directions

Potential evolution includes AI‑driven self‑healing (ML‑based fault prediction and automated remediation), regular chaos‑engineering experiments, eBPF‑based kernel observability to fill monitoring blind spots, and full service‑mesh adoption (e.g., Istio) for unified traffic governance, circuit‑breaking, and retries.

FAQ

How to set circuit‑breaker thresholds?

Use historical percentiles: set the threshold at 1.5 × P99 latency over the past 30 days and error‑rate threshold at three times the daily average. Validate in a load‑test environment to avoid false trips.

Which features to degrade during an avalanche?

Prioritize by business value and technical complexity: keep order‑payment always on, downgrade coupons to a fixed discount, and fully disable recommendation/comment features. Define degradation switches and fallback logic in the configuration center beforehand.

How to avoid production impact when load‑testing?

Run tests in an isolated environment with a production‑snapshot (masked) database. If production testing is unavoidable, schedule during low‑traffic windows, limit load to <10 % of live traffic, and tag test requests with a special header for immediate discard.

How to roll back a faulty Kubernetes version quickly?

kubectl rollout undo deployment/order-service --to-revision=3 -n production

Keep the last ten image versions; rollback should complete in under three minutes.

Best practices for distributed tracing?

Sample 1 % of traces in normal operation; automatically increase to 100 % when an alert fires. Record business‑level fields in spans for easy filtering. Auto‑trigger alerts when trace P99 latency exceeds the defined threshold.

How to evaluate incident handling effectiveness?

Key metrics: MTTD < 2 min, MTTR < 15 min, impact scope < 5 % of traffic, and repeat‑incident rate ≈ 0. Track trends to assess improvements in response capability.

SREPrometheuscapacity planningService Avalanche
Raymond Ops
Written by

Raymond Ops

Linux ops automation, cloud-native, Kubernetes, SRE, DevOps, Python, Golang and related tech discussions.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.