Operations 20 min read

How Bilibili Overcame a Massive CDN Outage: Cloud‑Edge Incident Response Lessons

This article details the August 2023 Bilibili CDN failure, analyzes its root causes, describes the 1‑5‑10 emergency recovery framework, and presents cloud‑side SLB/BFS optimizations and edge‑side scheduling and fallback strategies that together restored service and improved future resilience.

ITPUB
ITPUB
ITPUB
How Bilibili Overcame a Massive CDN Outage: Cloud‑Edge Incident Response Lessons

Background

On August 4, 2023, a CDN service failure from a cloud provider caused a sudden surge of back‑origin traffic, overloading the BFS SLB and leading to widespread white‑screen and video playback issues for users relying on image, JS, and CSS resources.

The failure triggered two automatic domain‑downgrade policies: "origin back‑routing" (domains resolve directly to the origin) and "domain disable" (domains return 403), both of which generated massive traffic and user‑visible errors.

Incident Analysis

North‑South Traffic Analysis

Client devices (APP, Web, OTT) request Bilibili image services through CDN → SLB (7‑layer load balancer) → BFS (internal small‑file storage). The architecture consists of:

CDN provided by multiple third‑party vendors.

BFS‑dedicated SLB clusters acting as the back‑origin target.

BFS service offering small‑file storage and image processing.

Cache cluster built with Apache Traffic Server, better suited for massive small‑file caching than NGINX.

Image service handling scaling, cropping, format conversion, etc.

Storage service backed by BOSS object storage.

Root‑Cause Analysis

Origin back‑routing flooded the origin with hundreds of gigabytes of traffic because the CDN’s cache hit rate (≈98%) dropped to 0.

SLB CPU hit 100 % in multiple AZs, returning 5XX errors; no rate‑limiting was configured under the assumption that CDN caching prevented overload.

Rate‑limit configuration failed to publish because both AZs’ CPUs were saturated.

Node scaling was ineffective; newly added nodes were immediately killed, and the limited number of backend machines could not handle the multiplied traffic.

BFS received no real back‑origin requests initially because traffic was stuck at the overloaded SLB.

The "domain disable" policy returned 403 to users, causing white‑screens and failed JS loads.

APP network library performed domain‑level downgrade retries for generic network failures but deliberately ignored 403 responses, so no automatic fallback occurred.

Emergency Recovery Phase

Bilibili follows a GOC (Goal‑Oriented‑Control) model based on the 1‑5‑10 principle: discover the fault within 1 minute, locate it within 5 minutes, and restore service within 10 minutes, aiming for a total recovery time under 16 minutes.

Fault prevention: pre‑check potential issues.

Fast recovery: rapid remediation of non‑preventable faults.

Fault improvement: avoid repeat incidents.

During the incident:

CDN detected abnormal error‑rate alerts in ~3 minutes, identified the vendor, and began traffic switchover to alternative providers.

SLB received CPU alerts in ~2 minutes, performed node scaling within 5 minutes, and applied gradual rate‑limiting to restore storage throughput.

BFS received back‑origin alerts in ~3 minutes, responded within 5 minutes, and scaled backend resources to handle post‑SLB traffic rebound.

Cloud‑Edge Coordination

Cloud: SLB Optimization

SLB serves as the data‑center entry guard; defenses such as WAF and request rate‑limiting are configured per core domain. Multi‑cluster isolation limits fault impact to static‑resource back‑origin only.

Online API data forwarding.

Static resource forwarding (image/html/css/js).

Offline log reporting.

During the outage, control plane and nodes lost connectivity, preventing configuration changes, and new nodes could not absorb the surge. Mitigations applied:

Global connection‑count limits tuned via load testing.

Reserved core resources for control plane to avoid CPU exhaustion.

BBR‑style adaptive rate‑limiting that discards traffic when load exceeds 80 %.

Cloud: BFS Optimization

BFS stores site‑wide static assets. The outage impacted:

HTML/CSS/JS loading failures → white‑screens.

Thumbnail unavailability → degraded user experience.

Internal review platform delays.

Developer tooling unable to load resources.

Optimizations included:

Backing up all static files to third‑party cloud storage and configuring CDN fallback to that storage when the origin is unavailable.

Separating the review platform’s image traffic into an independent SLB cluster.

Isolating internal middleware (SLB control platform, release platform) onto dedicated physical machines and enabling dual‑write to both original and new BFS clusters.

The revised BFS architecture is shown in the accompanying diagram.

Edge: Image CDN Intelligent Scheduling

The existing LDNS+HttpDNS scheduling suffers from coarse granularity and latency. By empowering the client side with scheduling capabilities, Bilibili introduced a new edge‑cloud collaborative system that improves accuracy and timeliness.

Client‑side retry strategy (A → B → C domains) mitigates single‑vendor failures but can cause cache‑stampede on the secondary domains.

Edge: APP Strategy

APP implements device‑level traffic splitting, multi‑domain coordination, and error‑code‑based retries.

Receive global split ratios (e.g., i0:i1:i2 = 4:3:3).

Compute bucket using buvid (device ID).

Map to a rule such as [i2, i0, i1] and route all requests to i2.

If i2 fails due to connection error, certificate error, or 5xx/502/504 status, the client retries the next domain, limiting retries to two attempts. Successful fallback becomes the default until a cold restart.

Edge: Web Strategy

Web follows a similar flow: receive domain ratio config, compute bucket via buvid, map to a default domain, and initially route all requests to that domain.

AutoFallback SDK

The SDK works with any image component. When an image error occurs, AutoFallback.getNext(img.src) returns the next resource info or null. If null, no further fallback is possible; if a valid NextInfo is returned, the SDK switches the image source, and when next.strategy == 2 it marks the fallback as final to avoid extra calculations.

Logic Chain

Remote switch (kv config) and local switch (cookie flag) determine whether fallback logic is active. Various combinations of remote/local switch states and cookie presence dictate whether the default domain is replaced, retained, or left unchanged.

Summary

Previous incident handling focused on individual service stability without a holistic view of upstream‑downstream interactions, leading to suboptimal downgrade and recovery measures. By leveraging the inherent distributed resilience of CDN, SLB, and BFS, standardizing rate‑limit policies, establishing layered downgrade playbooks, and exposing client‑side scheduling, overall link resilience is greatly improved.

Future work aims to deepen cloud‑edge integration, push traffic‑scheduling intelligence to the client, and reinforce the 1‑5‑10 governance model to raise component SLAs and support business growth.

References

http://mp.weixin.qq.com/s?__biz=Mzg3Njc0NTgwMg==∣=2247485603&idx=1&sn=7c8d68c49840cf4702f39ce9d2ef0b42&scene=21#wechat_redirect

https://mp.weixin.qq.com/s?__biz=MzIyODIwMjMzOA==∣=2655201519&idx=1&sn=d0e315772b337292ff4a26edcbecd2c2&scene=21#wechat_redirect

https://www.cnblogs.com/haiyux/p/15227815.html

https://developer.aliyun.com/article/1353951

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

cloud-nativeEdge ComputingOperationsload balancingCDNincident-response
ITPUB
Written by

ITPUB

Official ITPUB account sharing technical insights, community news, and exciting events.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.