Operations 13 min read

Understanding High Availability: Lessons from the Bilibili Outage

This article analyzes Bilibili's recent service disruption, explains the concept and quantitative metrics of high availability, and outlines practical techniques such as rate limiting, isolation, failover, timeout control, circuit breaking, degradation, and multi‑region active‑active deployments to improve system reliability.

Wukong Talks Architecture
Wukong Talks Architecture
Wukong Talks Architecture
Understanding High Availability: Lessons from the Bilibili Outage

Background

On July 13‑14, 2021, Bilibili experienced a major outage where the website returned 404 errors, the mobile app failed to load data, and various services reported 502 errors, with gradual recovery starting after 02:00.

Cause

Bilibili announced that a failure in several server data centers caused the outage; the issue was resolved by the technical team, and rumors of a fire were denied.

What is High Availability

High Availability (HA) is the ability of a system to remain operational with minimal downtime. It is measured using metrics such as Mean Time Between Failures (MTBF) and Mean Time To Repair (MTTR), with availability calculated as MTBF/(MTBF+MTTR)×100%.

Typical availability targets are expressed as “nines”; for example, five‑nine availability requires less than 5 minutes of annual downtime.

Quantitative Analysis of Bilibili

The Bilibili incident lasted about one hour, corresponding to roughly three‑nine availability on an annual basis and two‑nine on a daily basis.

How to Achieve High Availability

Common techniques include failover, timeout control, rate limiting, isolation, circuit breaking, and degradation.

Rate Limiting

Controls request flow, allowing only 放行部分请求 to pass. Algorithms include fixed window, sliding window, leaky bucket, and token bucket.

Fixed Window

Counts total requests in a fixed interval; exceeds threshold → limit.

Sliding Window

Counts requests over a moving time window.

Leaky Bucket

Outputs requests at a constant rate, buffering bursts.

Token Bucket

Issues tokens at a fixed rate; each request consumes a token.

Isolation

Each service runs independently so failures do not cascade; tools such as Sentinel and Hystrix are commonly used.

Failover

Two types: active‑active (equal nodes) and active‑passive (master‑slave). Leader election algorithms like Paxos and Raft detect master failure.

Timeout Control

Limits the time a request can take; excessive timeouts can cause cascading failures ("snowball" effect) represented by 雪崩 .

Circuit Breaking

When a downstream service repeatedly fails, the circuit opens and calls return degraded data instead of waiting; this is referred to as 断路保护 .

Degradation

During traffic spikes, non‑critical services can be deliberately downgraded to return simple fallback responses, i.e., 返回降级数据 .

Geographic Multi‑Active Deployments

Deploying services across multiple data centers (same‑city active‑active, cross‑city active‑active, and multi‑region active‑active) improves resilience but introduces data synchronization challenges. Solutions include shared databases, master‑slave replication, and asynchronous replication via message queues.

Conclusion

The Bilibili outage illustrates the importance of designing for high availability and provides a practical checklist for engineers.

Distributed Systemshigh-availabilityRate Limitingcircuit-breakerMTBFMTTRHA
Wukong Talks Architecture
Written by

Wukong Talks Architecture

Explaining distributed systems and architecture through stories. Author of the "JVM Performance Tuning in Practice" column, open-source author of "Spring Cloud in Practice PassJava", and independently developed a PMP practice quiz mini-program.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.