Operations 19 min read

Bilibili CDN/SLB Outage Analysis and Cloud‑Edge Coordination Strategies

The August 4 2023 Bilibili outage, triggered by automatic back‑origin and domain‑disable policies that flooded the BFS load balancer with traffic, caused widespread white‑screens, but was mitigated within the 1‑5‑10 framework through rapid CDN switching, rate‑limit enhancements, storage backup, and client‑side fallback, illustrating the need for tighter cloud‑edge coordination.

Bilibili Tech
Bilibili Tech
Bilibili Tech
Bilibili CDN/SLB Outage Analysis and Cloud‑Edge Coordination Strategies

Background

On August 4, 2023, from 21:00 to 21:20, a CDN service failure from a cloud provider caused a sudden surge of back‑origin traffic. The overload of the BFS SLB (Bilibili File Service Load Balancer) led to widespread white‑screen and video playback failures for users relying on image, JS, and CSS static resources.

The incident was triggered by two automatic domain‑offline policies:

Back‑origin resolution: when a domain is taken offline, it is resolved directly to the origin server.

Domain disable: the domain is set to a "disabled" state, CDN acceleration stops and a 403 response is returned.

Both policies generated massive traffic spikes and 403 responses, causing service paralysis.

Event Analysis

1. North‑South Traffic Flow

Client devices (APP, Web, OTT) request Bilibili images through CDN → SLB (7‑layer load balancer) → BFS (internal small‑file storage). The architecture includes:

CDN: multiple third‑party providers.

BFS‑dedicated SLB: four‑layer load balancer serving as CDN back‑origin.

BFS service: small‑file storage with image processing.

Cache cluster: built with Apache Traffic Server, more suitable for massive small files than NGINX Proxy Cache.

Image service: provides scaling, cropping, rotation, format conversion.

Storage service: internal object storage (BOSS).

2. Root‑Cause Analysis

Back‑origin policy: normal CDN hit rate is 98%; when the domain went offline, all traffic was forced to origin, instantly generating hundreds of gigabytes of requests.

SLB overload: multi‑AZ deployment, CPU reached 100% in seconds, returning 5XX errors. No rate‑limiting was configured because CDN was assumed to cache most traffic.

Failed rate‑limit deployment: both AZs hit CPU 100% simultaneously, preventing configuration updates.

Node expansion failure: new nodes were added but immediately marked unhealthy, unable to handle the surge.

Domain‑disable policy: CDN returned 403, causing white‑screen or failed page loads.

APP retry logic: the network library performs domain‑level fallback (A → B) but does not retry on 403, as 403 is treated as a security block.

3. Emergency Recovery Phase

Bilibili follows the GOC (Global Operations Center) 1‑5‑10 framework:

1 minute: fault detection.

5 minutes: fault localization.

10 minutes: fault recovery.

The target for a global incident is 1 + 5 + 10 minutes, i.e., total recovery within 16 minutes.

Execution details:

CDN: alerted within ~3 min, identified the provider, switched traffic to a backup CDN.

SLB: CPU alarm at ~2 min, node expansion performed within 5 min, followed by gradual rate‑limit release.

BFS: back‑origin alarm at ~3 min, response actions within 5 min, concurrent backend scaling.

4. Cloud‑Edge Collaboration

4.1 Cloud – SLB Optimization

Added global connection limits and BBR‑style adaptive rate limiting (discard traffic when CPU > 80%).

Reserved core resources for the control plane to keep configuration updates possible.

Implemented multi‑cluster isolation so that a failure only impacts static‑resource back‑origin, not the entire service.

4.2 Cloud – BFS Optimization

Backed up all static files to third‑party cloud storage; in total‑outage scenarios, fallback to this storage via CDN.

Isolated the review platform’s image traffic into a dedicated SLB cluster.

Separated internal middleware (SLB management platform, release platform) onto independent physical machines to break circular dependencies.

4.3 Edge – Image CDN Intelligent Scheduling

The system consists of three sub‑systems: intelligent alarm → root‑cause analysis → automatic scheduling. It achieves ~95 % automated fault handling at the ISP‑level, but global incidents still rely on manual decisions.

DNS Scheduling

LDNS‑based traffic steering is the industry‑standard for image/CDN routing. Bilibili combines LDNS with HttpDNS, but the granularity is coarse and latency is high.

Client‑Side Scheduling

Since late 2022, Bilibili has deployed client‑side fallback for images and APIs. The workflow includes:

Receive global split‑traffic policy (e.g., domain ratios 4:3:3 for i0:i1:i2).

Compute bucket using device ID (buvid).

Map bucket to a prioritized domain list, e.g., [i2, i0, i1].

All requests initially use the first domain; on failure, retry the next domain up to two times.

Failure handling covers connection errors, certificate errors, and HTTP 5xx/504. Successful fallback updates the default domain until the next cold‑start.

AutoFallback SDK

The SDK intercepts image error events, calls AutoFallback.getNext(img.src) to obtain the next resource URL, and respects a final‑fallback flag to avoid unnecessary calculations.

5. Summary

Previous incident responses focused on individual component stability, neglecting end‑to‑end coordination. By unifying rate‑limit policies, establishing layered degradation plans, and leveraging client‑side scheduling, overall resilience can be greatly improved.

Future work will deepen cloud‑edge integration, push traffic‑dispatch intelligence to the edge, and continue to apply the 1‑5‑10 reliability governance model to raise SLA guarantees across all link layers.

References

“十三天精通超大规模分布式存储系统架构设计——浅谈B站对象存储(BOSS)实现” (WeChat article)

“阿里巴巴稳定性指标1-5-10调研” (WeChat article)

Kratos BBR rate‑limit source analysis (cnblogs)

Alibaba mobile network library evolution (aliyun developer)

Cloud NativeEdge ComputingCDNincident managementReliability EngineeringSLB
Bilibili Tech
Written by

Bilibili Tech

Provides introductions and tutorials on Bilibili-related technologies.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.