Backend Development 17 min read

Avalanche Prevention Architecture in Baidu Netdisk: Practices and Solutions

Baidu Netdisk engineers protect its billion‑user service from cascading failures by deploying dynamic circuit‑breaker overload control, priority‑based traffic isolation, request‑validity filtering, socket‑level disconnect detection, and unified timestamp handling, a combination that dramatically reduces avalanche incidents and boosts overall availability.

Baidu Tech Salon
Baidu Tech Salon
Baidu Tech Salon
Avalanche Prevention Architecture in Baidu Netdisk: Practices and Solutions

This article discusses how Baidu Netdisk engineers design and implement anti‑avalanche mechanisms to protect large‑scale services from cascading failures.

Background : Baidu Netdisk serves over 1 billion users with more than 600 k instances and thousands of modules. In high‑concurrency scenarios, brief anomalies can trigger a system‑wide avalanche, where downstream services become overloaded and cannot recover automatically.

Snowball (avalanche) phenomenon : A request overload causes a service to reject traffic; upstream services retry, amplifying the load and creating a feedback loop. The article illustrates the two stages—initial overload and the recursive retry loop—using diagrams.

Traditional solutions are divided into three sub‑areas:

Prevention: hotspot mitigation, long‑tail handling, tiered operations, capacity guarantees.

Blocking: retry‑rate control, queue control, static rate limiting.

Mitigation: dynamic throttling, service restarts.

These approaches have limitations, especially under dynamic load and DDoS attacks.

Advanced practices in Baidu Netdisk :

Dynamic circuit‑breaker based overload control : The circuit breaker tracks request success/failure, switches between Closed, Open, and Half‑Open states, and uses a cooling period before testing recovery. The flow is shown in the code snippet below.

Traffic isolation : Requests are labeled by priority (high‑priority vs. low‑priority) and routed via a gateway or service mesh, ensuring low‑priority spikes do not affect critical traffic.

Request validity filtering : By measuring the time a request spends in the downstream queue and its processing latency, the system discards requests that have already timed out.

Socket‑level validity detection : Using low‑level socket APIs (e.g., read() returning 0 for FIN) to detect client disconnects and avoid processing stale requests. Implementations for brpc (IsCanceled) and Go (r.Context().Done()) are described.

Code example – Dynamic circuit‑breaker flow :

流程说明
1 开始请求: 系统接收到一个外部请求。
2 熔断器状态判断:
- 闭合状态(Closed): 熔断器允许请求通过,继续执行。
- 打开状态(Open): 熔断器阻止请求,直接失败或返回预定义的响应。
- 半开状态(Half-Open): 熔断器允许部分请求通过,以测试服务是否恢复。
3 执行请求:
- 请求成功:
    - 如果处于闭合状态,则重置失败计数。
    - 如果处于半开状态,则关闭熔断器,恢复正常。
- 请求失败:
    - 增加失败计数。
    - 如果失败计数超过预设阈值,则打开熔断器,跳闸。
4 冷却时间: 熔断器在打开状态后,会等待一段冷却时间,然后进入半开状态。
5 半开状态测试:
- 请求成功: 关闭熔断器,恢复正常。
- 请求失败: 重新打开熔断器,继续等待冷却时间。

The article also explains how Baidu Netdisk combines absolute and relative timestamps via the UFC Service Mesh to convert client‑side timeouts into absolute deadlines, ensuring consistent request expiration across machines.

Finally, the summary emphasizes that a combination of traffic limiting (dynamic circuit breaking, isolation) and traffic processing (validity checks) has significantly reduced avalanche incidents, improving overall service availability.

Backend Architecturetraffic isolationservice reliabilityavalanche preventioncircuit breakerrequest validity
Baidu Tech Salon
Written by

Baidu Tech Salon

Baidu Tech Salon, organized by Baidu's Technology Management Department, is a monthly offline event that shares cutting‑edge tech trends from Baidu and the industry, providing a free platform for mid‑to‑senior engineers to exchange ideas.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.