Operations 9 min read

How to Prevent Avalanche Failures in Large‑Scale Microservice Systems

This article explains how Baidu's SRE team identified the root causes of avalanche failures in massive microservice architectures, modeled system limits with Little’s Law, and implemented engineering practices such as retry budgets, queue throttling, and global TTL controls to achieve self‑healing and eliminate avalanche incidents.

Baidu Intelligent Cloud Tech Hub

Oct 29, 2025

How to Prevent Avalanche Failures in Large‑Scale Microservice Systems

At SREcon25 in Dublin, Baidu Cloud Operations and Search Architecture jointly presented "Preventing Avalanche Failures in Large‑Scale Microservice Systems," showcasing their work on microservice stability, system‑wide crash‑prevention mechanisms, and resilient architecture design, which earned international recognition.

1. From Flexibility to Fragility: Avalanche Risk in Complex Microservices

Unpredictable system boundary behavior : Coupled mechanisms cause unpredictable behavior under burst scenarios.

Cascade capacity risk : A single service failure can amplify along the call chain.

Side effects of high‑availability mechanisms : Features like retries can exacerbate load in extreme cases.

2. Avalanche Is Not Sudden – It Is the Inevitable Result of a Non‑Steady State

System enters non‑steady state : Surface metrics appear normal while the system approaches a critical point.

Disturbance triggers avalanche : Minor disturbances (traffic jitter, network jitter, cache miss, small faults) push the system over the critical point, leading to an irreversible death spiral.

Avalanche development : Positive feedback loops (availability drop → retry → load increase → further availability drop) accelerate the collapse.

Complete avalanche : The system reaches a state where effective throughput plummets and cannot recover without external intervention.

3. Theoretical Model: System Throughput Limit

Using Little’s Law, Baidu built a throughput constraint model where each microservice’s upper limit is determined by thread concurrency and request latency. When local latency rises and threads are saturated, the RPS constraint breaks, pushing the entire chain into an unstable feedback zone. The model extends to a three‑layer structure (request queue + worker threads + backend dependencies) applicable to deep scheduling chains.

4. Microscopic View of the Avalanche Process

In a typical call chain (gateway → Service A → Service B → Service C), a latency increase in Service C causes simultaneous thread utilization spikes in A and B, queue buildup, and a feedback‑worsening loop that quickly drives the system to a complete collapse within seconds.

5. Anti‑Avalanche Engineering – Let the System Self‑Heal

Early Warning: Detecting Non‑Steady State

Multi‑layer monitoring tracks full‑link failure counts, latency distribution, queue length, thread usage, and P95/P99 latency with second‑level granularity, feeding an anomaly‑detection model for "seconds‑level detection, minutes‑level response".

Core Interventions

Retry Budget : A global retry budget pool distinguishes direct and indirect retries; when exhausted, requests fail fast, turning exponential retry traffic into linear growth.

Queue Throttling : Prioritized request queues keep only high‑priority tasks during congestion, with adaptive rate limiting and timeout clearing.

Global TTL Control : Each request carries a TTL that decrements along the call chain; expiration aborts further processing, preventing wasteful calls.

Multi‑Dimensional Intervention : When key metrics (P99 latency, failure rate, thread usage) breach thresholds, automatic actions such as cross‑IDC traffic shifting, internal traffic clipping, service policy trimming, and dynamic timeout reduction are triggered for "second‑level decision + automatic execution".

The design philosophy is to control, not eliminate, feedback intensity.

6. Conclusion

Through systematic governance, Baidu Search has dramatically improved the stability of its massive microservice ecosystem, eliminating avalanche incidents over multiple quarters. The SREcon25 talk highlighted the team’s research outcomes and methodological innovations, positioning Baidu as a leader in large‑scale system reliability.

Future work will continue to explore autonomous operations, stability modeling, and intelligent self‑healing mechanisms in collaboration with the global SRE community.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

microservices SRE system resilience reliability engineering avalanche failure

Written by

Baidu Intelligent Cloud Tech Hub

We share the cloud tech topics you care about. Feel free to leave a message and tell us what you'd like to learn.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.