dbaplus Community
Jul 24, 2025 · Operations
How Bilibili Scales Server Fault Management with Automated Detection and Repair
This article details Bilibili's approach to handling explosive growth in server count by classifying faults, identifying shortcomings of manual processes, and implementing an automated, end‑to‑end detection, rule‑based alerting, and repair workflow that combines in‑band and out‑of‑band data collection to achieve near‑perfect coverage and accuracy.
Data centerfault detectionin‑band
0 likes · 17 min read
