Databases 17 min read

Database Failure Management: Types, Mitigation Strategies, and Bilibili’s Practices

The article outlines common database and cache failures—such as instance outages, replication lag, data corruption, and cache avalanches—while detailing Bilibili’s mitigation strategies including high‑availability architectures, scaling, multi‑active designs, proxy controls, slow‑query alerts, fault‑injection drills, and ongoing resilience improvements.

Bilibili Tech
Bilibili Tech
Bilibili Tech
Database Failure Management: Types, Mitigation Strategies, and Bilibili’s Practices

In March this year, GitHub experienced multiple service outages lasting 2–5 hours each, affecting up to 73 million developers. GitHub’s senior engineering VP Keith Ballinger explained that the root cause was resource contention in the “MySQL1” cluster during peak load, which impacted many services.

Database failures can severely affect enterprise systems. Drawing on Bilibili’s own experience, this article shares insights on handling database faults.

1. What is a Database Failure? There is no strict academic definition; companies usually quantify failures by the impact on applications. Common failure categories are illustrated below.

2. Common Database Failures

2.1 MySQL

1.1 Instance Unavailability

Instances may become unavailable due to hardware failures (CPU, memory, disk, motherboard), system bugs (OS or DB bugs), network issues (device or dedicated line failures), or resource mis‑allocation such as OOM.

1.2 Data Latency

Includes master‑slave replication lag and binlog‑based subscription service delay.

1.2.1 Replication Lag

Causes include large transactions on the master, high write frequency, DDL on large tables, low‑performance slave hardware, high slave load, MDL locks, network problems, and replication bugs (especially multithreaded replication).

1.2.2 Subscription Service Delay

Caused by upstream slave lag, Kafka bottlenecks, or performance limits of canal‑type components.

1.3 Data Corruption

Reasons include missing double‑write parameters during crashes, lack of semi‑synchronous replication, DDL tools (e.g., pt‑osc, gh‑ost) losing data, and human error (accidental or malicious deletions).

1.4 Performance Degradation

Manifested as increased slow queries, performance jitter, or overload.

1.4.1 Increase in Slow Queries

Root causes: inefficient new‑business SQL or missing indexes, changed business scenarios, data skew, low InnoDB buffer pool hit rate, optimizer bugs.

1.4.2 Performance Jitter

Occasional spikes due to internal DB behaviors such as dirty page flushing, batch jobs, or scheduled tasks.

1.4.3 Overload

Triggered by sudden traffic spikes, upstream cache failures, or large‑scale events exceeding capacity.

2. Cache Failures

2.1 Cache Penetration

Occurs when queries for non‑existent data miss both cache and persistence layers, often due to mismatched keys or malicious attacks.

2.2 Cache Breakdown

Happens when hot keys (e.g., flash‑sale items) cause massive concurrent rebuilds, overwhelming the backend.

2.3 Cache Avalanche

Results from many keys expiring simultaneously or Redis failures, leading to a flood of requests to the database and possible system collapse.

3. Bilibili’s Database Fault Governance Practices

3.1 High Availability

Designs incorporate HA components (MGR, PXC, Orchestrator, MySQL Replication Manager, MHA, etc.) to minimize downtime when instances become unavailable.

3.2 Scaling

Vertical scaling (increasing buffer pool, CPU quotas) and horizontal scaling (read‑only pools, multi‑zone replicas, TiDB with Kubernetes HPA) are employed. Bilibili heavily uses TiDB for elastic workloads.

3.3 Multi‑Active Architecture

Includes remote disaster recovery, same‑city dual‑active, cross‑region active‑active, and multi‑center designs. Traffic routing is handled via database proxies, SDK metadata, or application‑level switches (CDN, SLB, API gateway).

3.4 Database Proxy

Proxies provide read/write separation, request interception, rate limiting, and circuit breaking, allowing fine‑grained control over problematic SQL patterns.

3.5 Slow Query Alerting

A slow‑query alert system collects logs, performs streaming analysis, and triggers alerts based on multi‑dimensional thresholds (e.g., 7‑day baseline, growth trends). It also supports self‑healing actions.

3.6 Case Study: Replication Bug

Problem: Appointment creation failed because the replica did not return data. Investigation showed show global variables like '%gtid%' revealed a stagnant gtid_binlog_pos , indicating a MariaDB multithreaded replication bug.

Resolution: Traffic was switched to the master via proxy configuration, and the canal component was pointed to the master, quickly containing the impact. Post‑mortem actions added a heartbeat table and automated failover logic based on HA metrics.

4. Database Fault Drills

Fault injection follows chaos engineering principles, simulating scenarios such as node crashes, CPU spikes, network latency, packet loss, disk saturation, I/O bursts, and OOM. Drills are integrated with application‑level tests.

Example: In Bilibili’s G‑zone architecture, a drill simulated an entire IDC outage, redirecting traffic via CDN to another data center to verify that the database, cache, and services could handle full load.

5. Summary and Outlook

Fault governance aims to prevent repeat incidents by proactively addressing root causes. Bilibili will continue to standardize fault definitions, regularize drills, and refine detection and remediation mechanisms to enhance system resilience.

CacheDatabaseMySQLHigh AvailabilityBilibilifailure management
Bilibili Tech
Written by

Bilibili Tech

Provides introductions and tutorials on Bilibili-related technologies.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.