Database Failure Management: Types, Mitigation Strategies, and Bilibili’s Practices
The article outlines common database and cache failures—such as instance outages, replication lag, data corruption, and cache avalanches—while detailing Bilibili’s mitigation strategies including high‑availability architectures, scaling, multi‑active designs, proxy controls, slow‑query alerts, fault‑injection drills, and ongoing resilience improvements.
In March this year, GitHub experienced multiple service outages lasting 2–5 hours each, affecting up to 73 million developers. GitHub’s senior engineering VP Keith Ballinger explained that the root cause was resource contention in the “MySQL1” cluster during peak load, which impacted many services.
Database failures can severely affect enterprise systems. Drawing on Bilibili’s own experience, this article shares insights on handling database faults.
1. What is a Database Failure? There is no strict academic definition; companies usually quantify failures by the impact on applications. Common failure categories are illustrated below.
2. Common Database Failures
2.1 MySQL
1.1 Instance Unavailability
Instances may become unavailable due to hardware failures (CPU, memory, disk, motherboard), system bugs (OS or DB bugs), network issues (device or dedicated line failures), or resource mis‑allocation such as OOM.
1.2 Data Latency
Includes master‑slave replication lag and binlog‑based subscription service delay.
1.2.1 Replication Lag
Causes include large transactions on the master, high write frequency, DDL on large tables, low‑performance slave hardware, high slave load, MDL locks, network problems, and replication bugs (especially multithreaded replication).
1.2.2 Subscription Service Delay
Caused by upstream slave lag, Kafka bottlenecks, or performance limits of canal‑type components.
1.3 Data Corruption
Reasons include missing double‑write parameters during crashes, lack of semi‑synchronous replication, DDL tools (e.g., pt‑osc, gh‑ost) losing data, and human error (accidental or malicious deletions).
1.4 Performance Degradation
Manifested as increased slow queries, performance jitter, or overload.
1.4.1 Increase in Slow Queries
Root causes: inefficient new‑business SQL or missing indexes, changed business scenarios, data skew, low InnoDB buffer pool hit rate, optimizer bugs.
1.4.2 Performance Jitter
Occasional spikes due to internal DB behaviors such as dirty page flushing, batch jobs, or scheduled tasks.
1.4.3 Overload
Triggered by sudden traffic spikes, upstream cache failures, or large‑scale events exceeding capacity.
2. Cache Failures
2.1 Cache Penetration
Occurs when queries for non‑existent data miss both cache and persistence layers, often due to mismatched keys or malicious attacks.
2.2 Cache Breakdown
Happens when hot keys (e.g., flash‑sale items) cause massive concurrent rebuilds, overwhelming the backend.
2.3 Cache Avalanche
Results from many keys expiring simultaneously or Redis failures, leading to a flood of requests to the database and possible system collapse.
3. Bilibili’s Database Fault Governance Practices
3.1 High Availability
Designs incorporate HA components (MGR, PXC, Orchestrator, MySQL Replication Manager, MHA, etc.) to minimize downtime when instances become unavailable.
3.2 Scaling
Vertical scaling (increasing buffer pool, CPU quotas) and horizontal scaling (read‑only pools, multi‑zone replicas, TiDB with Kubernetes HPA) are employed. Bilibili heavily uses TiDB for elastic workloads.
3.3 Multi‑Active Architecture
Includes remote disaster recovery, same‑city dual‑active, cross‑region active‑active, and multi‑center designs. Traffic routing is handled via database proxies, SDK metadata, or application‑level switches (CDN, SLB, API gateway).
3.4 Database Proxy
Proxies provide read/write separation, request interception, rate limiting, and circuit breaking, allowing fine‑grained control over problematic SQL patterns.
3.5 Slow Query Alerting
A slow‑query alert system collects logs, performs streaming analysis, and triggers alerts based on multi‑dimensional thresholds (e.g., 7‑day baseline, growth trends). It also supports self‑healing actions.
3.6 Case Study: Replication Bug
Problem: Appointment creation failed because the replica did not return data. Investigation showed show global variables like '%gtid%' revealed a stagnant gtid_binlog_pos , indicating a MariaDB multithreaded replication bug.
Resolution: Traffic was switched to the master via proxy configuration, and the canal component was pointed to the master, quickly containing the impact. Post‑mortem actions added a heartbeat table and automated failover logic based on HA metrics.
4. Database Fault Drills
Fault injection follows chaos engineering principles, simulating scenarios such as node crashes, CPU spikes, network latency, packet loss, disk saturation, I/O bursts, and OOM. Drills are integrated with application‑level tests.
Example: In Bilibili’s G‑zone architecture, a drill simulated an entire IDC outage, redirecting traffic via CDN to another data center to verify that the database, cache, and services could handle full load.
5. Summary and Outlook
Fault governance aims to prevent repeat incidents by proactively addressing root causes. Bilibili will continue to standardize fault definitions, regularize drills, and refine detection and remediation mechanisms to enhance system resilience.
Bilibili Tech
Provides introductions and tutorials on Bilibili-related technologies.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.