How Bilibili Scales Server Fault Management with Automated Detection and Repair
This article details Bilibili's approach to handling explosive growth in server count by classifying faults, identifying shortcomings of manual processes, and implementing an automated, end‑to‑end detection, rule‑based alerting, and repair workflow that combines in‑band and out‑of‑band data collection to achieve near‑perfect coverage and accuracy.
Background
Rapid growth of server count in a large‑scale video platform created severe fault‑management challenges: manual inspection was slow, toolchains were fragmented, and communication overhead was high. An end‑to‑end automation of fault detection and repair was required.
Fault Classification
Faults are divided into:
Soft faults : software errors, service anomalies.
Hard faults : hardware failures (disk, NIC, GPU).
Each fault can be online (non‑impacting) or offline (requires service downtime).
Limitations of Traditional Management
Late fault discovery – reliance on manual checks or user reports.
Low investigation efficiency – long resolution times.
High communication cost – manual notifications (e.g., enterprise WeChat).
Insufficient automation – no systematic audit or traceability.
Automation Goals
Accelerate fault discovery and pinpointing via an information‑collection → rule‑matching → alert pipeline.
Eliminate manual communication and fully automate the repair lifecycle (task generation, callback confirmation, closure).
Automated Fault Detection Architecture
The system consists of five core components:
Agent : a lightweight on‑server daemon that gathers hardware metrics (disk health, NIC status, GPU health) using OS utilities (e.g., dmidecode, lspci) and vendor‑specific tools, then reports to the detection service.
Log Platform : receives system logs via rsyslog for parsing and correlation.
Detection Service : hosts two detection modules:
In‑band detection – processes Agent data and kernel dmesg logs.
Out‑of‑band detection – consumes SNMP Traps and Redfish API data from BMCs.
Database : stores fault‑detection rules and generated alerts.
Fault Management Platform : visualizes alerts and provides operators with a UI for monitoring and triage.
In‑band Information Collection
The Agent invokes OS tools and vendor SDKs to collect fine‑grained metrics:
Disk module : reports remaining SSD life (%), bad‑block count, and other SMART attributes for NVMe/SSD/HDD devices. Data is structured and sent to both the detection service and an asset inventory service.
GPU module : wraps NVIDIA DCGM, DCMI, and Huawei GPU interfaces to expose utilization, ECC error counts, power draw, and XID error events. The module normalizes vendor‑specific fields into a unified health model.
Out‑of‑band Information Collection
When a server crashes, in‑band data may be unavailable. The out‑of‑band collector queries the BMC via:
Redfish API : standard RESTful interface for hardware health, power state, and inventory.
SNMP Traps : vendor‑defined OIDs push real‑time alerts (e.g., temperature thresholds, fan failures). SNMP offers high accuracy; Redfish provides a fallback for OIDs not covered.
Fault Rule Management
A unified rule library defines each fault with the following fields:
Fault code (unique identifier)
Description
Component type (NIC, NVMe, GPU, etc.)
Severity level (P0‑P2)
Rule expression – either a log‑keyword match or a metric‑threshold condition (e.g., disk_life_percent < 5 or gpu_xid_error_count > 0).
Automated Repair Workflow
The repair process consists of two phases:
Fault detection & task generation : an alert triggers automatic creation of a repair ticket and loads business‑specific callback configuration.
Callback confirmation : the system contacts the business side via a predefined API to confirm fault details, initiate service downtime (if needed), and close the loop after repair.
Two repair modes are supported:
Online repair : lightweight issues (e.g., disk‑life exhaustion) are fixed without service interruption.
Offline repair : hardware replacement requiring downtime, typically completed within 48 hours.
Repair Process Automation
Automatic email or API notifications to stakeholders with fault details and attached logs.
Asset inventory updates after component replacement (disk, NIC, GPU, etc.).
Server status transitions (e.g., “under repair”, “delivered”) driven by the system workflow.
Post‑repair health checks to verify that the repaired component meets normal thresholds.
Results and Outlook
Combined in‑band and out‑of‑band monitoring achieved:
Coverage: 99 %
Detection accuracy: 99 %
Recall: 95 %
These metrics indicate a drastic reduction in false positives and missed faults. Future work includes applying machine‑learning models for predictive health monitoring, refining fault localization, and strengthening security and reliability of the maintenance pipeline.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
dbaplus Community
Enterprise-level professional community for Database, BigData, and AIOps. Daily original articles, weekly online tech talks, monthly offline salons, and quarterly XCOPS&DAMS conferences—delivered by industry experts.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
