Tagged articles
9 articles
Page 1 of 1
MaGe Linux Operations
MaGe Linux Operations
Oct 16, 2025 · Operations

SRE Playbook: From Alert to Full Recovery of Service Avalanches

This comprehensive SRE guide walks through a real-world service avalanche incident, detailing alert triggering, root‑cause analysis, step‑by‑step recovery, capacity baseline creation, layered alert design, automated scripts, and post‑mortem best practices to help engineers prevent and resolve large‑scale outages.

AlertingSREService Avalanche
0 likes · 20 min read
SRE Playbook: From Alert to Full Recovery of Service Avalanches
Efficient Ops
Efficient Ops
Oct 23, 2023 · Operations

Why Redis Failed: Jedis Misconfigurations That Spark Service Avalanches

This article examines a Redis 3.x cluster failure caused by a master‑slave switch, detailing how improper Jedis timeout and retry settings triggered a service avalanche, and provides step‑by‑step analysis of the incident, code paths, and recommended configuration adjustments to prevent recurrence.

JedisService Avalancheconnection timeout
0 likes · 12 min read
Why Redis Failed: Jedis Misconfigurations That Spark Service Avalanches
Sohu Tech Products
Sohu Tech Products
Aug 23, 2023 · Backend Development

Analysis of Service Avalanche Caused by Jedis Parameter Misconfiguration During Redis Cluster Failover

During a Redis 3.x cluster master‑slave failover, the default Jedis connection timeout of two seconds combined with six automatic retries caused each request’s Redis calls to accumulate up to sixty seconds of latency, triggering Nginx timeouts and a service‑avalanche, which was resolved by lowering timeout and retry settings.

Cluster FailoverConnection RetryJedis
0 likes · 13 min read
Analysis of Service Avalanche Caused by Jedis Parameter Misconfiguration During Redis Cluster Failover
vivo Internet Technology
vivo Internet Technology
Jul 19, 2023 · Databases

Analysis of Service Avalanche Caused by Misconfigured Jedis Parameters During Redis Cluster Master‑Slave Switch

A service‑wide avalanche occurred when a Redis 3.x master‑slave failover coincided with Jedis’ default 2‑second connection timeout and six retry attempts, causing up to 60‑second latencies; adjusting connectionTimeout, soTimeout to 100 ms and reducing maxAttempts to two limited latency to about one second and prevented cascade failures.

ClusterConnection RetryJedis
0 likes · 13 min read
Analysis of Service Avalanche Caused by Misconfigured Jedis Parameters During Redis Cluster Master‑Slave Switch
macrozheng
macrozheng
Nov 12, 2020 · Operations

Red Cliffs Battle: Lessons on Service Avalanche and Circuit Breakers

Using the historic Red Cliffs battle as a metaphor, this article explains how linked services can cause a cascading failure—service avalanche—in microservice architectures, and details prevention techniques such as rate limiting, isolation, and especially circuit breaker mechanisms with their principles and recovery algorithms.

Service Avalanchecircuit breakersystem reliability
0 likes · 13 min read
Red Cliffs Battle: Lessons on Service Avalanche and Circuit Breakers
Wukong Talks Architecture
Wukong Talks Architecture
Oct 28, 2020 · Operations

From the Battle of Red Cliffs to Service Avalanche: Understanding Circuit Breaker and Resilience in Microservices

This article uses the historic Battle of Red Cliffs as an analogy to explain service avalanche in micro‑service architectures, analyzes its causes, presents real‑world scenarios, and details circuit‑breaker concepts, algorithms, recovery strategies, and practical mitigation techniques.

ResilienceService Avalanchecircuit breaker
0 likes · 10 min read
From the Battle of Red Cliffs to Service Avalanche: Understanding Circuit Breaker and Resilience in Microservices