Operations 18 min read

What Bilibili’s Outage Teaches About Achieving True High Availability

The article analyzes Bilibili’s recent service outage, explains why high availability matters, introduces key metrics like MTBF and MTTR, and outlines practical strategies such as redundancy, rate limiting, isolation, failover, timeout control, circuit breaking, degradation, and multi‑region deployment to build resilient systems.

21CTO
21CTO
21CTO
What Bilibili’s Outage Teaches About Achieving True High Availability

1. Background

Imagine an abnormal scenario that actually happened: Bilibili’s website went down at 11 p.m. last night, and the homepage returned a 404 error.

The mobile app could not load data.

At 23:30 Bilibili displayed a degraded page that redirected the 404 to a friendlier error page.

Refreshing the page again caused another 404.

At 22:35 the homepage loaded data, but clicking 动态 still returned a 502 error.

Clicking a video directly returned a 404.

After 02:00 on 2021‑07‑14 Bilibili gradually recovered.

2. Cause

At 02:00 Bilibili announced that part of its server rooms experienced a failure, causing the outage. The technical team investigated and restored services. The rumor about a fire in Bilibili’s building was denied by Shanghai fire officials.

It appears Bilibili’s high‑availability is not satisfactory. The following sections explore what high‑availability means and how to design cross‑region deployments.

3. What Is High Availability?

After a two‑hour outage, does Bilibili’s system qualify as highly available?

High availability (HA) is a relative term that describes a system’s ability to run with minimal failures.

High availability is commonly achieved with a primary‑replica architecture: when the primary node fails, the replica can be switched to quickly take over.

Typical HA solutions include primary‑secondary setups such as SQL Server replication or Redis master‑slave, allowing continuous service even if a server crashes.

Quantitative HA is measured using MTBF (Mean Time Between Failures) and MTTR (Mean Time To Repair).

MTBF : the average interval between consecutive failures; longer intervals indicate higher stability.

MTTR : the average time to recover from a failure; shorter times reduce user complaints.

Availability formula : MTBF / (MTBF + MTTR) × 100 %.

Many systems aim for “nines” of availability. Our project targets less than five minutes of annual downtime, i.e., five‑nine availability.

Analyzing Bilibili’s outage (23:00 – 02:00), the downtime exceeded one hour, achieving only three‑nine availability annually and two‑nine on a daily basis—far from the desired level.

3.1 Qualitative HA

High availability is a qualitative description of fault‑tolerant systems.

3.2 Quantitative HA

Key concepts: MTBF and MTTR, as described above.

3.3 Bilibili’s Quantitative HA

Based on the outage duration, Bilibili reached only three‑nine annual availability and two‑nine daily availability.

3.4 One‑Nine and Two‑Nine

Most online services can easily avoid daily downtime of more than 15 minutes.

3.5 Three‑Nine and Four‑Nine

Achieving higher “nines” requires improvements in architecture, code quality, operations, and incident response. The operations team plays a crucial role: for major incidents, the operation lead must intervene directly.

During emergencies, manual degradation or feature toggles can be used to limit functionality and preserve core services.

3.6 Five‑Nine

Five‑nine (≤ 5 minutes annual downtime) is extremely hard to achieve with manual response; automated operations are required so that servers self‑recover.

3.7 Six‑Nine

Six‑nine (≈ 32 seconds annual downtime) is an even stricter standard, typically reserved for critical systems.

4. How to Achieve High Availability

Common HA techniques include failover, timeout control, rate limiting, isolation, circuit breaking, and degradation.

For more details see the article “Double 11 Traffic‑Control Soup”.

4.1 Rate Limiting

Rate limiting allows only a portion of requests to pass, preventing overload.

Common algorithms: fixed window, sliding window, token bucket, and leaky bucket.

4.1.1 Fixed Window

Fixed windows count total requests in a period; exceeding the threshold blocks traffic. See the referenced article for details.

Fixed window : counts requests in a fixed interval.

Drawback : cannot limit burst traffic within a short span.

4.1.2 Leaky Bucket

Requests are released at a constant rate; excess requests are buffered, which can increase latency.

4.1.3 Token Bucket

Allows N requests per second; tokens are added to the bucket at a rate of 1/N seconds. In distributed environments Redis can serve as the token store.

4.2 Isolation

Each service runs as an independent system; a failure in one does not affect others.

Common tools: Sentinel and Hystrix.

4.3 Failover

Two types of failover:

Peer‑to‑peer failover: all nodes are equal and can take over traffic.

Primary‑secondary failover: a standby node takes over when the primary fails.

Leader election algorithms such as Paxos and Raft are used to detect primary failures.

4.4 Timeout Control

Modules must set reasonable request timeouts. Excessive timeouts (e.g., 30 s) can cause thread blockage and cascade failures, known as a “snowball” effect.

Example cascade: inventory service timeout → product service blocked → order service blocked → client retries → further overload.

Reasonable timeouts should be set for inter‑service calls, database queries, cache accesses, and third‑party APIs.

4.5 Circuit Breaking

When a downstream service repeatedly fails or times out, the caller opens a circuit ( 断路保护) and returns fallback data, preventing cascade failures.

Further details can be found in the “Service Snowball” article.

4.6 Degradation

During traffic spikes, non‑critical services or pages can be deliberately degraded, returning simple fallback responses (e.g., “Server busy, please try later”) to preserve core functionality.

Similarity between circuit breaking and degradation : both aim to keep the majority of services available.

Difference : circuit breaking is triggered by a failing downstream service; degradation is a proactive, system‑wide decision.

5. Multi‑Region Active‑Active Deployment

5.1 Multi‑Data‑Center Deployment

Deploy multiple service instances across different geographic data centers while sharing the same business data.

If one service fails, traffic can be switched to another region.

Database can be shared or separate. Shared‑database architecture is simpler; separate databases require synchronization.

Shared database across data centers.

Separate databases per region with asynchronous replication.

Cross‑region data transfer latency varies: same‑city dedicated line 1‑3 ms, inter‑city 50 ms, cross‑country ~200 ms.

5.2 Same‑City Dual‑Active

To avoid cross‑region calls, services, caches, and databases are kept within the same data center, using registration groups and primary‑secondary patterns.

5.3 Inter‑Region Active‑Active

Same‑city dual‑active cannot provide disaster recovery at the city level, so inter‑region active‑active is needed.

Two approaches for data synchronization:

Primary‑secondary replication provided by MySQL or Redis (may suffer performance issues with large data).

Asynchronous replication via message queues: data changes are published as messages and consumed by remote regions.

5.4 Two‑Region Three‑Center Architecture

This model combines a local data center, a same‑city data center, and a remote data center, implementing both same‑city dual‑active and inter‑region active‑active strategies.

Through Bilibili’s incident, I learned many lessons; this article is intended to spark discussion.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

Operationshigh availabilitySystem Designfault tolerancerate limitingcircuit breakerMTBFMTTR
21CTO
Written by

21CTO

21CTO (21CTO.com) offers developers community, training, and services, making it your go‑to learning and service platform.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.