Operations 17 min read

How Bilibili Scales Server Fault Management with Automated Detection and Repair

This article details Bilibili's approach to handling explosive growth in server count by classifying faults, identifying shortcomings of manual processes, and implementing an automated, end‑to‑end detection, rule‑based alerting, and repair workflow that combines in‑band and out‑of‑band data collection to achieve near‑perfect coverage and accuracy.

dbaplus Community

Jul 24, 2025

How Bilibili Scales Server Fault Management with Automated Detection and Repair

Background

Rapid growth of server count in a large‑scale video platform created severe fault‑management challenges: manual inspection was slow, toolchains were fragmented, and communication overhead was high. An end‑to‑end automation of fault detection and repair was required.

Fault Classification

Faults are divided into:

Soft faults : software errors, service anomalies.

Hard faults : hardware failures (disk, NIC, GPU).

Each fault can be online (non‑impacting) or offline (requires service downtime).

Limitations of Traditional Management

Late fault discovery – reliance on manual checks or user reports.

Low investigation efficiency – long resolution times.

High communication cost – manual notifications (e.g., enterprise WeChat).

Insufficient automation – no systematic audit or traceability.

Automation Goals

Accelerate fault discovery and pinpointing via an information‑collection → rule‑matching → alert pipeline.

Eliminate manual communication and fully automate the repair lifecycle (task generation, callback confirmation, closure).

Automated Fault Detection Architecture

The system consists of five core components:

Agent : a lightweight on‑server daemon that gathers hardware metrics (disk health, NIC status, GPU health) using OS utilities (e.g., dmidecode, lspci) and vendor‑specific tools, then reports to the detection service.

Log Platform : receives system logs via rsyslog for parsing and correlation.

Detection Service : hosts two detection modules:

In‑band detection – processes Agent data and kernel dmesg logs.

Out‑of‑band detection – consumes SNMP Traps and Redfish API data from BMCs.

Database : stores fault‑detection rules and generated alerts.

Fault Management Platform : visualizes alerts and provides operators with a UI for monitoring and triage.

In‑band Information Collection

The Agent invokes OS tools and vendor SDKs to collect fine‑grained metrics:

Disk module : reports remaining SSD life (%), bad‑block count, and other SMART attributes for NVMe/SSD/HDD devices. Data is structured and sent to both the detection service and an asset inventory service.

GPU module : wraps NVIDIA DCGM, DCMI, and Huawei GPU interfaces to expose utilization, ECC error counts, power draw, and XID error events. The module normalizes vendor‑specific fields into a unified health model.

Out‑of‑band Information Collection

When a server crashes, in‑band data may be unavailable. The out‑of‑band collector queries the BMC via:

Redfish API : standard RESTful interface for hardware health, power state, and inventory.

SNMP Traps : vendor‑defined OIDs push real‑time alerts (e.g., temperature thresholds, fan failures). SNMP offers high accuracy; Redfish provides a fallback for OIDs not covered.

Fault Rule Management

A unified rule library defines each fault with the following fields:

Fault code (unique identifier)

Description

Component type (NIC, NVMe, GPU, etc.)

Severity level (P0‑P2)

Rule expression – either a log‑keyword match or a metric‑threshold condition (e.g., disk_life_percent < 5 or gpu_xid_error_count > 0).

Automated Repair Workflow

The repair process consists of two phases:

Fault detection & task generation : an alert triggers automatic creation of a repair ticket and loads business‑specific callback configuration.

Callback confirmation : the system contacts the business side via a predefined API to confirm fault details, initiate service downtime (if needed), and close the loop after repair.

Two repair modes are supported:

Online repair : lightweight issues (e.g., disk‑life exhaustion) are fixed without service interruption.

Offline repair : hardware replacement requiring downtime, typically completed within 48 hours.

Repair Process Automation

Automatic email or API notifications to stakeholders with fault details and attached logs.

Asset inventory updates after component replacement (disk, NIC, GPU, etc.).

Server status transitions (e.g., “under repair”, “delivered”) driven by the system workflow.

Post‑repair health checks to verify that the repaired component meets normal thresholds.

Results and Outlook

Combined in‑band and out‑of‑band monitoring achieved:

Coverage: 99 %

Detection accuracy: 99 %

Recall: 95 %

These metrics indicate a drastic reduction in false positives and missed faults. Future work includes applying machine‑learning models for predictive health monitoring, refining fault localization, and strengthening security and reliability of the maintenance pipeline.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Monitoring fault detection data center out‑of‑band server operations in‑band

Written by

dbaplus Community

Enterprise-level professional community for Database, BigData, and AIOps. Daily original articles, weekly online tech talks, monthly offline salons, and quarterly XCOPS&DAMS conferences—delivered by industry experts.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.