Operations 15 min read

What Makes a System Truly High‑Availability? Lessons from B‑Station’s Outage

The article examines B‑Station’s July 2021 outage, explains the concept and quantitative metrics of high availability, and outlines practical techniques such as rate limiting, isolation, failover, timeout control, circuit breaking, degradation, and multi‑region deployment to achieve resilient systems.

ITPUB
ITPUB
ITPUB
What Makes a System Truly High‑Availability? Lessons from B‑Station’s Outage

Background

On the night of July 13‑14, 2021 B‑Station (Bilibili) experienced a sudden outage: the homepage returned 404, the mobile app could not load data, and various pages intermittently returned 502 or 404. The service gradually recovered after a few hours.

B‑Station outage screenshot
B‑Station outage screenshot

Root Cause

At 02:00 on July 14 the company announced that a server‑room failure caused the incident. The technical team performed rapid troubleshooting and restoration; the outage was not due to a fire.

Incident announcement
Incident announcement

High Availability Overview

Definition

High Availability (HA) describes a system’s ability to remain operational with minimal downtime. A typical HA design uses a primary‑secondary (master‑slave) architecture so that if the primary node fails, the secondary can take over quickly.

Quantitative Metrics

MTBF (Mean Time Between Failures) – average interval between successive failures; larger MTBF indicates higher stability.

MTTR (Mean Time To Repair) – average time required to restore service after a failure; smaller MTTR reduces user impact.

Availability is calculated as MTBF / (MTBF + MTTR) * 100%. Industry practice expresses availability as “nines” (e.g., 99.9% = three nines). The B‑Station incident (≈1 hour outage) corresponds to roughly three nines annually and two nines daily.

Availability calculation diagram
Availability calculation diagram

Techniques for Achieving High Availability

Common Techniques

Rate Limiting – controls request flow (fixed window, sliding window, token bucket, leaky bucket) to prevent overload.

Isolation – treats each service as an independent system; failures in one do not affect others. Tools: Sentinel, Hystrix.

Failover – redirects traffic from a failed node to a healthy one. Can be symmetric (peer‑to‑peer) or asymmetric (primary‑backup) using leader election algorithms such as Paxos or Raft.

Timeout Control – sets reasonable request timeouts (e.g., 30s) to avoid cascading “snowball” effects when downstream services become slow or unavailable.

Circuit Breaking – opens the circuit when a downstream service repeatedly fails or times out, returning fallback data instead of waiting.

Degradation – during spikes or incidents, non‑essential features are disabled and predefined fallback responses are returned to preserve core functionality.

Rate Limiting Algorithms

Fixed Window – counts requests within a fixed interval; excess requests are rejected. Does not limit short bursts.

Sliding Window – moves the time window continuously, providing smoother rate control.

Leaky Bucket – emits requests at a constant rate; excess requests are queued, which can increase latency.

Token Bucket – allows up to N requests per second; tokens are added to the bucket at a fixed rate. Distributed implementations often use Redis.

Token bucket diagram
Token bucket diagram

Isolation Tools

Sentinel and Hystrix provide circuit‑breaker and fallback capabilities to isolate failures.

Sentinel and Hystrix
Sentinel and Hystrix

Failover Strategies

Symmetric failover uses identical nodes that can each handle full traffic; any node can replace another.

Asymmetric failover uses a primary node with one or more standby nodes (hot or cold). Leader election (Paxos, Raft) detects primary failure and triggers a switch.

Timeout‑Induced Cascading Failures

If a downstream service (e.g., inventory) becomes unresponsive, upstream services wait, exhausting thread pools and causing a chain reaction that can bring the entire system down.

Cascading failure diagram
Cascading failure diagram

Circuit Breaking

When a service repeatedly fails, the circuit opens and calls return fallback data, preventing the failure from propagating.

Circuit breaker illustration
Circuit breaker illustration

Degradation

During high load, non‑critical services can be temporarily disabled, returning simple fallback responses (e.g., “service busy, please try later”) to preserve core functionality.

Degradation flow
Degradation flow

Multi‑Region Active‑Active Deployment

Cross‑Region Architecture

Deploy multiple service instances in different data centers that share the same business data. If one region fails, traffic can be shifted to another region.

Database strategies:

Shared database across regions.

Separate databases per region with synchronization (more complex).

Shared database diagram
Shared database diagram
Separate databases with sync
Separate databases with sync

Typical network latencies:

Intra‑city dedicated line: 1‑3 ms.

Inter‑city dedicated line: ~50 ms.

Cross‑country: ~200 ms.

Same‑City Dual‑Active

Group services, caches, and databases per city so that RPC and cache accesses stay within the same data center, avoiding cross‑region calls.

Active‑Active Across Regions

Because latency is higher, data synchronization becomes critical. Two common approaches:

Master‑slave replication (e.g., MySQL, Redis) – simple but may become a performance bottleneck at large scale.

Asynchronous replication via message queues – changes are published as messages and consumed by remote regions.

Two‑Site‑Three‑Center Model

Combines a local data center, a same‑city data center, and a remote data center, extending dual‑active concepts to three locations for greater resilience.

Two‑site‑three‑center diagram
Two‑site‑three‑center diagram

Achieving true high availability requires quantitative metrics, robust architectural patterns, and disciplined operational practices.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

system reliabilityfault tolerancerate limitingmulti-region deploymentcircuit breakerMTBFMTTR
ITPUB
Written by

ITPUB

Official ITPUB account sharing technical insights, community news, and exciting events.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.