How We Automated Server Fault Detection and Repair at Scale
This article explains the challenges of managing rapidly growing server fleets, outlines a systematic classification of hardware and software faults, and details an end‑to‑end automated solution that combines in‑band and out‑of‑band data collection, rule‑based detection, and fully automated repair workflows to improve fault coverage, accuracy, and recovery speed.
1. Background
As Bilibili’s business expands, the number of servers has exploded, supporting massive user requests and complex scenarios. The rapid growth makes manual fault handling inefficient and toolchains fragmented, creating high operational costs.
Low efficiency of manual troubleshooting.
Scattered toolchains for different hardware.
Efficient server fault management is therefore critical for platform stability and user experience.
2. Server Fault
2.1 Fault Classification
Faults are divided into soft and hard categories. Soft faults include system software errors, service anomalies, etc. Hard faults cover disk, NIC, GPU hardware failures, among others. Additionally, faults can be classified by repair mode: online (non‑impacting) and offline (requires downtime).
We currently focus on hard faults and file‑system faults.
2.2 Limitations of Traditional Fault Management
Late fault discovery due to reliance on manual checks or user reports.
Low investigation efficiency.
High communication cost via enterprise WeChat.
Insufficient automation of the repair process.
Thus we decided to fully automate fault detection and repair.
2.3 Goals
Resolve late detection and low investigation efficiency by introducing an automated detection pipeline (information collection → rule matching → alert).
Reduce communication cost and improve process automation through an automated repair solution.
3. Automated Fault Detection Solution
The architecture consists of five core components: in‑band Agent, out‑of‑band BMC, log platform, detection service, rule database, and fault‑management platform.
Agent: lightweight component on the server that collects hardware status (disk, NIC, GPU) and reports to the detection service.
Log platform: receives system logs via rsyslog for analysis.
Detection service: includes in‑band modules (processing Agent data and dmesg) and out‑of‑band modules (processing SNMP traps and Redfish API data).
DB: stores detection rules and generated alerts.
Fault‑management platform: visualizes alerts for users.
3.1 Information Collection Methods
Two main methods are used:
3.1.1 In‑band Collection
Relies on the operating system and software tools to gather detailed system data. While it provides fine‑grained information, it stops working when the server crashes.
3.1.2 Out‑of‑band Collection
Uses independent management hardware (BMC) to collect data even when the server is down, focusing on hardware health. It complements in‑band collection for comprehensive monitoring.
3.2 Fault Rule Management
A unified rule library defines fault codes, descriptions, types (e.g., NIC, NVMe, GPU), severity levels (P0‑P2), and trigger expressions (log keywords, metric thresholds). This enables rapid fault identification and guided remediation.
4. Automated Repair Solution
4.1 Business Up/Down Automation
Previously, fault notifications were sent via enterprise WeChat, requiring manual analysis and guidance. The new automated workflow creates repair tasks, interacts with business callbacks for confirmation, and closes the loop without human intervention.
4.2 Repair Process Automation
Key functions include:
Email/API notifications to relevant parties after task generation.
Automatic asset updates when hardware (disk, NIC, GPU) is replaced.
Server status auto‑transition (e.g., "repair", "delivered") driven by task progress.
Post‑repair health checks to verify restored operation.
5. Summary and Outlook
5.1 Summary
The presented architecture integrates hardware monitoring, log collection, fault detection, and automated repair, achieving 99% coverage, 99% accuracy, and 95% recall across the server fleet.
5.2 Outlook
Intelligent Monitoring: Apply machine learning and AI to predict failures and trigger proactive repairs.
More Efficient Fault Localization: New technologies will streamline diagnosis and reduce maintenance costs.
Enhanced Security and Reliability: Strengthen hardware and software security to prevent data leaks and unauthorized access.
High Availability Architecture
Official account for High Availability Architecture.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
