Operations 16 min read

How We Automated Server Fault Detection and Repair at Scale

This article explains the challenges of managing rapidly growing server fleets, outlines a systematic classification of hardware and software faults, and details an end‑to‑end automated solution that combines in‑band and out‑of‑band data collection, rule‑based detection, and fully automated repair workflows to improve fault coverage, accuracy, and recovery speed.

High Availability Architecture

Jul 22, 2025

How We Automated Server Fault Detection and Repair at Scale

1. Background

As Bilibili’s business expands, the number of servers has exploded, supporting massive user requests and complex scenarios. The rapid growth makes manual fault handling inefficient and toolchains fragmented, creating high operational costs.

Low efficiency of manual troubleshooting.

Scattered toolchains for different hardware.

Efficient server fault management is therefore critical for platform stability and user experience.

2. Server Fault

2.1 Fault Classification

Faults are divided into soft and hard categories. Soft faults include system software errors, service anomalies, etc. Hard faults cover disk, NIC, GPU hardware failures, among others. Additionally, faults can be classified by repair mode: online (non‑impacting) and offline (requires downtime).

We currently focus on hard faults and file‑system faults.

2.2 Limitations of Traditional Fault Management

Late fault discovery due to reliance on manual checks or user reports.

Low investigation efficiency.

High communication cost via enterprise WeChat.

Insufficient automation of the repair process.

Thus we decided to fully automate fault detection and repair.

2.3 Goals

Resolve late detection and low investigation efficiency by introducing an automated detection pipeline (information collection → rule matching → alert).

Reduce communication cost and improve process automation through an automated repair solution.

3. Automated Fault Detection Solution

The architecture consists of five core components: in‑band Agent, out‑of‑band BMC, log platform, detection service, rule database, and fault‑management platform.

Agent: lightweight component on the server that collects hardware status (disk, NIC, GPU) and reports to the detection service.

Log platform: receives system logs via rsyslog for analysis.

Detection service: includes in‑band modules (processing Agent data and dmesg) and out‑of‑band modules (processing SNMP traps and Redfish API data).

DB: stores detection rules and generated alerts.

Fault‑management platform: visualizes alerts for users.

3.1 Information Collection Methods

Two main methods are used:

3.1.1 In‑band Collection

Relies on the operating system and software tools to gather detailed system data. While it provides fine‑grained information, it stops working when the server crashes.

3.1.2 Out‑of‑band Collection

Uses independent management hardware (BMC) to collect data even when the server is down, focusing on hardware health. It complements in‑band collection for comprehensive monitoring.

3.2 Fault Rule Management

A unified rule library defines fault codes, descriptions, types (e.g., NIC, NVMe, GPU), severity levels (P0‑P2), and trigger expressions (log keywords, metric thresholds). This enables rapid fault identification and guided remediation.

4. Automated Repair Solution

4.1 Business Up/Down Automation

Previously, fault notifications were sent via enterprise WeChat, requiring manual analysis and guidance. The new automated workflow creates repair tasks, interacts with business callbacks for confirmation, and closes the loop without human intervention.

4.2 Repair Process Automation

Key functions include:

Email/API notifications to relevant parties after task generation.

Automatic asset updates when hardware (disk, NIC, GPU) is replaced.

Server status auto‑transition (e.g., "repair", "delivered") driven by task progress.

Post‑repair health checks to verify restored operation.

5. Summary and Outlook

5.1 Summary

The presented architecture integrates hardware monitoring, log collection, fault detection, and automated repair, achieving 99% coverage, 99% accuracy, and 95% recall across the server fleet.

5.2 Outlook

Intelligent Monitoring: Apply machine learning and AI to predict failures and trigger proactive repairs.

More Efficient Fault Localization: New technologies will streamline diagnosis and reduce maintenance costs.

Enhanced Security and Reliability: Strengthen hardware and software security to prevent data leaks and unauthorized access.

Monitoring operations server fault management hardware detection

Written by

High Availability Architecture

Official account for High Availability Architecture.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.