How Bilibili Scaled Server Fault Management with Automated Detection and Repair
This article details Bilibili's evolving server fault management architecture, covering fault classification, the shortcomings of manual processes, and the design of an automated detection and repair system that combines in‑band and out‑of‑band data collection, rule‑based alerts, and end‑to‑end repair automation.
1. Background
With Bilibili’s rapid business growth, the number of servers has exploded, bringing challenges in fault management as the scale of machines increases.
Low efficiency of manual handling: traditional manual troubleshooting cannot keep up with the massive server fleet.
Fragmented toolchains: diverse hardware requires different tools, increasing complexity and time cost.
Efficient server fault management is therefore critical for platform stability and user experience.
2. Server Faults
2.1 Fault Classification
Faults are divided into soft faults (software errors, service exceptions) and hard faults (disk, NIC, GPU hardware failures). Additionally, faults can be categorized by repair mode: online (non‑impacting) and offline (requires downtime).
We currently focus on detecting and handling hard faults and file‑system faults.
2.2 Shortcomings of Traditional Fault Management
Delayed fault discovery: reliance on manual checks or user feedback.
Low investigation efficiency: manual root‑cause analysis is time‑consuming.
High communication cost during business migration: manual notifications increase response time.
Insufficient automation in the repair process: lack of systematic audit and traceability.
Thus we decided to fully automate fault detection and repair.
2.3 Goals
Eliminate delayed fault discovery and low investigation efficiency by implementing an automated detection pipeline (information collection → rule matching → alert).
Reduce communication cost and improve process automation through an automated repair workflow.
3. Automated Fault Detection Solution
The overall architecture consists of five core components: an in‑band Agent, a log platform, a detection service, a rule database, and a fault‑management platform.
Agent: lightweight component on each server that collects hardware status (disk, NIC, GPU) and reports to the detection service.
Log platform: receives system logs via rsyslog for analysis.
Detection service: includes in‑band modules (processing Agent data and dmesg logs) and out‑of‑band modules (processing SNMP traps and Redfish API data).
DB: stores detection rules and generated alerts.
Fault‑management platform: visualizes alerts for users.
3.1 Information Collection Methods
We identified two main collection methods: in‑band and out‑of‑band.
In‑band Collection
Relies on the operating system and software tools, providing rich system data and fine‑grained monitoring, but stops working when the server crashes.
Out‑of‑band Collection
Uses independent management hardware (BMC) to gather data even when the server is down, focusing on hardware‑level metrics.
Combining both methods yields comprehensive monitoring in large data centers.
3.1.1 In‑band Collection Details
Disk Module monitors storage devices (NVMe, SSD, HDD) and controllers, collecting health metrics such as remaining life percentage and bad block count. The Agent invokes tools like dmidecode and lspci, structures the data, and reports it to detection and asset services.
GPU Module monitors core and memory utilization, ECC errors, power consumption, etc. The Agent wraps DCGM and DCMI interfaces to support NVIDIA, Huawei, and other GPUs, normalizing health data for the detection service.
3.1.2 Out‑of‑band Collection Details
When a server crashes, out‑of‑band collection gathers critical fault information via Redfish API or SNMP traps.
SNMP Trap vs Redfish
Higher accuracy: OID‑based mapping provides fine‑grained fault classification, reducing false positives.
Higher flexibility: subscription mechanisms allow selective monitoring per vendor/model.
Because OID definitions vary across vendors, we supplement SNMP with Redfish health checks for coverage.
3.2 Fault Rule Management
A unified rule library standardizes fault identification, impact assessment, and handling procedures. Key fields include fault code, description, type (e.g., NIC, NVMe, GPU), severity level (P0‑P2), and rule expression (log keywords, metric thresholds).
4. Automated Repair Solution
4.1 Business Up/Down Automation
Previously, fault notifications relied on manual WeChat messages, leading to delays. The new automated workflow triggers repair tasks, generates callbacks, and closes the repair loop, improving efficiency.
The process includes fault detection → task generation → callback confirmation → repair execution, supporting both online (non‑downtime) and offline (requires downtime) repairs.
4.2 Repair Process Automation
Key functions:
Email or API notifications to relevant parties after task creation.
Automatic asset updates after hardware replacement (disk, NIC, GPU, etc.).
Automated server status transitions (e.g., "repair requested", "delivered").
Post‑repair health checks to verify restored operation.
5. Summary and Outlook
5.1 Summary
The architecture integrates hardware monitoring, log analysis, and fault management to achieve full‑lifecycle server fault handling. Combined in‑band and out‑of‑band monitoring yields:
Coverage: 99%
Accuracy: 99%
Recall: 95%
Real‑time monitoring, proactive alerts, and periodic inspections ensure most faults are detected and resolved promptly.
5.2 Outlook
Intelligent monitoring: Apply machine learning and AI to detect early fault signs and provide predictive alerts.
More efficient fault localization: New technologies will streamline diagnosis and accelerate resolution, reducing maintenance costs.
Enhanced security and reliability: Future maintenance will emphasize hardware/software security to prevent data leaks and unauthorized access.
Cognitive Technology Team
Cognitive Technology Team regularly delivers the latest IT news, original content, programming tutorials and experience sharing, with daily perks awaiting you.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
