Operations 13 min read

Client‑Side DCDN Disaster‑Recovery Drills and Automated Testing at Bilibili

Bilibili performed client-side DCDN disaster-recovery drills using a self-built HTTPDNS to simulate DNS, CDN, and SSL faults; automated scripts across Android, iOS, and Web injected errors, measured rendering latency, validated immediate downgrade to commercial services, refined fallback strategies, and demonstrated near-zero user impact during a real network incident.

Bilibili Tech
Bilibili Tech
Bilibili Tech
Client‑Side DCDN Disaster‑Recovery Drills and Automated Testing at Bilibili

Background – The quality of network requests underpins all interactions in Internet products. For a high‑traffic service like Bilibili, fast and reliable distribution of static and dynamic resources is essential for a good user experience. Static assets are cached at edge nodes, while dynamic resources rely on intelligent routing. The stability of DCDN (Dynamic CDN) therefore becomes critical.

Exploration Objectives – To evaluate the role of DNS in DCDN acceleration, the team replaced the default operator‑provided LocalDNS with HTTPDNS, which bypasses LocalDNS and directs queries to a service that selects the optimal edge node based on carrier and region information. A self‑built HTTPDNS service was deployed to allow controlled manipulation of DNS responses, IPs, and certificates for disaster‑recovery testing.

Exploration Scenarios

DNS cache & TTL – checking local DNS cache existence and TTL expiration.

HTTPDNS – traffic distribution between self‑built and commercial HTTPDNS services, service health.

DCDN – node service status and IP validity.

Client‑side domain downgrade strategy – conditions that trigger downgrade, duration, affected domains, and post‑downgrade stability.

The team first used mock interception to tamper with HTTPDNS responses (status codes, v4/v6 IPs, etc.) and then built a full‑stack test environment with automated scripts that drive the client SDK to issue network requests.

Overall Design

Automated scripts trigger requests to core domains via the network SDK.

Client configuration is modified to route DNS queries to the self‑built HTTPDNS service.

Various fault types are simulated, including service errors, CDN node failures, SSL certificate errors, domain resolution failures, blocking, and hijacking.

Resolved IPs are used to establish connections; the rendering process is recorded, frames are split, and each stage’s latency, idle windows, and continuous loading are analyzed.

Cache and connection optimizations are applied, and key logs from both client and HTTPDNS are analyzed.

Practical Execution – Scripts emulate real user behavior across Android, iOS, and Web clients, intercepting core domain requests and injecting faults such as 4xx/5xx responses, CDN node anomalies, IPv4/IPv6 errors, SSL errors, and TTL expirations. The client’s rendering pipeline is automatically split into stable intervals using key‑frame markers. A pre‑trained SVM classifier groups these intervals, and timestamps are used to compute stage‑wise latency.

The process includes calculating frame‑to‑frame differences using PSNR or SSIM, deriving a deviation metric (1‑SSIM), and comparing it against a threshold (with an optional offset) to decide whether to start a new stable interval.

Results & Benefits

Validated that self‑built HTTPDNS can immediately downgrade to commercial services when encountering 4xx/5xx errors.

Identified CDN node failure handling, SSL error recovery, and domain‑resolution fallback behaviors.

Discovered that static‑resource CDN domains should remain constant within a cold‑start lifecycle.

Improved domain‑downgrade configurations and retry/timeout strategies to reduce user‑visible impact.

During a real network incident, traffic was smoothly shifted between data centers, error rates dropped quickly, and users experienced virtually no degradation.

Conclusion & Outlook – The drill provided a clear view of the client‑side request acceleration chain and highlighted optimization opportunities such as dynamic IP ranking based on success/failure metrics and adaptive domain ordering based on availability. Continued refinement of these techniques will further stabilize first‑load performance for users across regions.

performance testingdisaster recoveryclient-sideBilibiliDCDNHttpDNS
Bilibili Tech
Written by

Bilibili Tech

Provides introductions and tutorials on Bilibili-related technologies.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.