Understanding High Availability: Lessons from the Bilibili Outage
This article analyzes Bilibili's recent service disruption, explains the concept and quantitative metrics of high availability, and outlines practical techniques such as rate limiting, isolation, failover, timeout control, circuit breaking, degradation, and multi‑region active‑active deployments to improve system reliability.
Background
On July 13‑14, 2021, Bilibili experienced a major outage where the website returned 404 errors, the mobile app failed to load data, and various services reported 502 errors, with gradual recovery starting after 02:00.
Cause
Bilibili announced that a failure in several server data centers caused the outage; the issue was resolved by the technical team, and rumors of a fire were denied.
What is High Availability
High Availability (HA) is the ability of a system to remain operational with minimal downtime. It is measured using metrics such as Mean Time Between Failures (MTBF) and Mean Time To Repair (MTTR), with availability calculated as MTBF/(MTBF+MTTR)×100%.
Typical availability targets are expressed as “nines”; for example, five‑nine availability requires less than 5 minutes of annual downtime.
Quantitative Analysis of Bilibili
The Bilibili incident lasted about one hour, corresponding to roughly three‑nine availability on an annual basis and two‑nine on a daily basis.
How to Achieve High Availability
Common techniques include failover, timeout control, rate limiting, isolation, circuit breaking, and degradation.
Rate Limiting
Controls request flow, allowing only 放行部分请求 to pass. Algorithms include fixed window, sliding window, leaky bucket, and token bucket.
Fixed Window
Counts total requests in a fixed interval; exceeds threshold → limit.
Sliding Window
Counts requests over a moving time window.
Leaky Bucket
Outputs requests at a constant rate, buffering bursts.
Token Bucket
Issues tokens at a fixed rate; each request consumes a token.
Isolation
Each service runs independently so failures do not cascade; tools such as Sentinel and Hystrix are commonly used.
Failover
Two types: active‑active (equal nodes) and active‑passive (master‑slave). Leader election algorithms like Paxos and Raft detect master failure.
Timeout Control
Limits the time a request can take; excessive timeouts can cause cascading failures ("snowball" effect) represented by 雪崩 .
Circuit Breaking
When a downstream service repeatedly fails, the circuit opens and calls return degraded data instead of waiting; this is referred to as 断路保护 .
Degradation
During traffic spikes, non‑critical services can be deliberately downgraded to return simple fallback responses, i.e., 返回降级数据 .
Geographic Multi‑Active Deployments
Deploying services across multiple data centers (same‑city active‑active, cross‑city active‑active, and multi‑region active‑active) improves resilience but introduces data synchronization challenges. Solutions include shared databases, master‑slave replication, and asynchronous replication via message queues.
Conclusion
The Bilibili outage illustrates the importance of designing for high availability and provides a practical checklist for engineers.
Wukong Talks Architecture
Explaining distributed systems and architecture through stories. Author of the "JVM Performance Tuning in Practice" column, open-source author of "Spring Cloud in Practice PassJava", and independently developed a PMP practice quiz mini-program.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.