How Baidu Netdisk Prevents Service Avalanches: Dynamic Circuit Breaking & Queue Control
This article analyzes Baidu Netdisk's anti‑avalanche architecture, explaining how avalanche cascades occur in high‑concurrency services and detailing practical prevention, blocking, and mitigation techniques such as dynamic circuit breaking, traffic isolation, request‑validity checks, and socket‑level detection to maintain system reliability.
Background
Baidu Netdisk serves over a billion users with daily page views exceeding one hundred billion, operating more than 600,000 instances and thousands of modules. In such a massive, high‑concurrency system, a brief anomaly can trigger a cascade (avalanche) that degrades user experience and cannot self‑recover.
How an Avalanche Happens
An avalanche starts when a service cannot handle incoming requests, causing failures that upstream services retry, amplifying the load and creating a feedback loop. The process can be visualized as:
Root cause → Service exception → Local avalanche → Global avalanche
Two stages are identified:
Initial stage: various root causes lead to service overload.
Loop stage: upstream retries cause the service to keep processing invalid requests.
Illustrations show how TCP connections remain in the accept queue even after the client disconnects, leading to wasted processing.
Traditional Solutions
Three sub‑directions are commonly used:
1. Prevention
Techniques such as hotspot mitigation, tail‑risk handling, tiered operations, and capacity guarantees aim to avoid the initial overload stage.
2. Blocking
When an avalanche is likely, mechanisms are employed to stop it from entering the loop stage.
Retry‑rate control : Stop retrying when the retry count exceeds a certain percentage of total requests.
Queue control : Application‑level queues track request waiting time and discard requests that have already timed out upstream.
Rate limiting : Set a static QPS threshold at the entry layer; excess traffic is dropped.
Each method has drawbacks, such as difficulty setting appropriate thresholds and the need for continuous tuning.
3. Damage Limitation
After an avalanche, actions like aggressive rate limiting or service restarts are used to recover, but they often prolong downtime.
Dynamic Circuit Breaking (Reducing Overload Traffic)
Static rate limiting suffers from inflexibility. A dynamic circuit‑breaker monitors downstream success rates and adjusts the forwarding rate accordingly:
1. Request arrives.
2. Circuit breaker checks state:
- Closed: allow request.
- Open: reject request.
- Half‑Open: allow limited requests for health check.
3. On success, reset failure count (or close half‑open).
4. On failure, increment count; if threshold exceeded, open circuit.
5. After cooldown, transition to half‑open for testing.The implementation in Baidu Netdisk randomly drops a configurable percentage X of requests, monitors downstream health, and adjusts X up or down (X = X ± Step) until stability is restored.
Traffic Isolation (Reducing Overload Traffic)
Requests are labeled by priority (high‑priority vs. low‑priority) and routed via a gateway or service mesh. High‑priority traffic is insulated from spikes in low‑priority traffic, ensuring critical services remain unaffected.
Request‑Validity Checks (Reducing Invalid Requests)
Request latency is broken down into client send time, network transfer, downstream queue wait time, and processing time. By tracking the time a request spends in the downstream queue, the system can discard requests that have already exceeded their effective deadline.
Absolute timestamps suffer from clock drift, while relative timestamps ignore queue wait time. Baidu Netdisk combines both using a Service Mesh (UFC) that converts relative time to absolute time on the downstream side, providing transparent deadline enforcement.
Socket‑Level Validity (Reducing Invalid Requests)
When a client closes a TCP connection before the server reads the request, the server may still process stale data. Detecting the FIN packet (read returns 0) allows the server to abort processing.
Implementation examples:
BRPC (C++) : Use IsCanceled() to detect client disconnect.
Go HTTP server : Check r.Context().Done() or read from the connection directly after ConnContext callback.
Benchmarks show BRPC detects disconnects faster than Go due to Go's separate goroutine for socket reads.
Summary
Baidu Netdisk's anti‑avalanche architecture consists of two main parts:
Traffic limiting: a front‑door layer handling DDoS and a dynamic circuit‑breaker that forwards only the amount of traffic the backend can handle.
Traffic processing: request‑validity checks that discard ineffective requests early, preventing wasteful processing.
This design has significantly reduced the frequency of avalanche incidents, improving overall service availability.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
