Tagged articles
1 articles
Page 1 of 1
dbaplus Community
dbaplus Community
Jul 24, 2025 · Operations

How Bilibili Scales Server Fault Management with Automated Detection and Repair

This article details Bilibili's approach to handling explosive growth in server count by classifying faults, identifying shortcomings of manual processes, and implementing an automated, end‑to‑end detection, rule‑based alerting, and repair workflow that combines in‑band and out‑of‑band data collection to achieve near‑perfect coverage and accuracy.

Data centerfault detectionin‑band
0 likes · 17 min read
How Bilibili Scales Server Fault Management with Automated Detection and Repair