What Makes a System Truly High‑Availability? Lessons from B‑Station’s Outage
The article examines B‑Station’s July 2021 outage, explains the concept and quantitative metrics of high availability, and outlines practical techniques such as rate limiting, isolation, failover, timeout control, circuit breaking, degradation, and multi‑region deployment to achieve resilient systems.
Background
On the night of July 13‑14, 2021 B‑Station (Bilibili) experienced a sudden outage: the homepage returned 404, the mobile app could not load data, and various pages intermittently returned 502 or 404. The service gradually recovered after a few hours.
Root Cause
At 02:00 on July 14 the company announced that a server‑room failure caused the incident. The technical team performed rapid troubleshooting and restoration; the outage was not due to a fire.
High Availability Overview
Definition
High Availability (HA) describes a system’s ability to remain operational with minimal downtime. A typical HA design uses a primary‑secondary (master‑slave) architecture so that if the primary node fails, the secondary can take over quickly.
Quantitative Metrics
MTBF (Mean Time Between Failures) – average interval between successive failures; larger MTBF indicates higher stability.
MTTR (Mean Time To Repair) – average time required to restore service after a failure; smaller MTTR reduces user impact.
Availability is calculated as MTBF / (MTBF + MTTR) * 100%. Industry practice expresses availability as “nines” (e.g., 99.9% = three nines). The B‑Station incident (≈1 hour outage) corresponds to roughly three nines annually and two nines daily.
Techniques for Achieving High Availability
Common Techniques
Rate Limiting – controls request flow (fixed window, sliding window, token bucket, leaky bucket) to prevent overload.
Isolation – treats each service as an independent system; failures in one do not affect others. Tools: Sentinel, Hystrix.
Failover – redirects traffic from a failed node to a healthy one. Can be symmetric (peer‑to‑peer) or asymmetric (primary‑backup) using leader election algorithms such as Paxos or Raft.
Timeout Control – sets reasonable request timeouts (e.g., 30s) to avoid cascading “snowball” effects when downstream services become slow or unavailable.
Circuit Breaking – opens the circuit when a downstream service repeatedly fails or times out, returning fallback data instead of waiting.
Degradation – during spikes or incidents, non‑essential features are disabled and predefined fallback responses are returned to preserve core functionality.
Rate Limiting Algorithms
Fixed Window – counts requests within a fixed interval; excess requests are rejected. Does not limit short bursts.
Sliding Window – moves the time window continuously, providing smoother rate control.
Leaky Bucket – emits requests at a constant rate; excess requests are queued, which can increase latency.
Token Bucket – allows up to N requests per second; tokens are added to the bucket at a fixed rate. Distributed implementations often use Redis.
Isolation Tools
Sentinel and Hystrix provide circuit‑breaker and fallback capabilities to isolate failures.
Failover Strategies
Symmetric failover uses identical nodes that can each handle full traffic; any node can replace another.
Asymmetric failover uses a primary node with one or more standby nodes (hot or cold). Leader election (Paxos, Raft) detects primary failure and triggers a switch.
Timeout‑Induced Cascading Failures
If a downstream service (e.g., inventory) becomes unresponsive, upstream services wait, exhausting thread pools and causing a chain reaction that can bring the entire system down.
Circuit Breaking
When a service repeatedly fails, the circuit opens and calls return fallback data, preventing the failure from propagating.
Degradation
During high load, non‑critical services can be temporarily disabled, returning simple fallback responses (e.g., “service busy, please try later”) to preserve core functionality.
Multi‑Region Active‑Active Deployment
Cross‑Region Architecture
Deploy multiple service instances in different data centers that share the same business data. If one region fails, traffic can be shifted to another region.
Database strategies:
Shared database across regions.
Separate databases per region with synchronization (more complex).
Typical network latencies:
Intra‑city dedicated line: 1‑3 ms.
Inter‑city dedicated line: ~50 ms.
Cross‑country: ~200 ms.
Same‑City Dual‑Active
Group services, caches, and databases per city so that RPC and cache accesses stay within the same data center, avoiding cross‑region calls.
Active‑Active Across Regions
Because latency is higher, data synchronization becomes critical. Two common approaches:
Master‑slave replication (e.g., MySQL, Redis) – simple but may become a performance bottleneck at large scale.
Asynchronous replication via message queues – changes are published as messages and consumed by remote regions.
Two‑Site‑Three‑Center Model
Combines a local data center, a same‑city data center, and a remote data center, extending dual‑active concepts to three locations for greater resilience.
Achieving true high availability requires quantitative metrics, robust architectural patterns, and disciplined operational practices.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
ITPUB
Official ITPUB account sharing technical insights, community news, and exciting events.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
