Why Do Database Failures Happen and How to Prevent Them?
This article examines common hardware and network failures in data centers, analyzes real‑world outage cases, classifies fault domains, and presents comprehensive strategies for database fault handling—including logging, checkpointing, backup, replication, and high‑availability architectures—to improve reliability and reduce downtime.
Fault Overview
ACID defines four transaction properties; durability (the D) ensures data is not lost despite failures. As distributed databases grow, complexity and failure scenarios increase.
Common Hardware Failures
Typical data‑center failure probabilities are illustrated below.
Designs, Lessons and Advice from Building Large Distributed Systems, Jeff Dean
Network Failures
Beyond hardware, distributed systems must consider additional network faults such as brain split (partition) and multi‑network‑segment failures.
Fragile Data Centers
Monitoring data from CloudHarmony (Nov 22 2017) shows many data centers, including Azure, fall short of the advertised 99.95% availability.
Notable outage examples:
Azure Nordic data‑center (Sep 29 2017): fire‑suppression system released agent, shutting down air‑handling units and causing a 7‑hour service disruption.
Amazon S3 (Feb 28 2017): accidental deletion of many servers made S3 unavailable for over 4 hours.
Google Compute Engine (Apr 13 2016): deletion of unused IP blocks triggered a network‑config consistency check, causing an 18‑minute outage.
Hangzhou Telecom cut Alibaba fiber (May 27 2015): 2‑hour outage.
Ningxia Bank core DB failure (Jul 1 2014): backup system error led to a 37‑hour service interruption.
Fault Classification
Fault domains group components that become unavailable together, such as physical machines, shared‑power racks, shared‑network racks, single‑fiber‑affected centers, or geographically co‑located centers.
Fault Evolution
Disk failure rates increase with age; research shows 1.7% failure in year 1, rising to 8.6% by year 3. AI and big‑data techniques are now applied to predict disk failures.
Database Fault Handling
Log Systems
Databases record changes via logs (redo, undo, or combined). PostgreSQL log structure is shown below.
Two log types:
Physical logs: fast replay, large size, simple implementation.
Logical logs: slower replay, smaller size, beneficial for MVCC databases.
Key principles:
WAL: log must be flushed before page flush; sync is required at transaction commit.
Recovery needs a complete log and a consistent data snapshot as a starting point.
Logs are a universal abstraction for state changes, enabling recovery, backup, and notification systems.
Log Recycling
Logs grow with data changes; recycling frees space and reduces recovery time. MVCC databases can cap log size independent of transaction commit.
Checkpoint
Checkpoints reclaim logs. Full checkpoint steps: record log position, flush all dirty pages with sync, write checkpoint record, recycle older logs (causes I/O spike). Incremental checkpoint writes pages in order of first modification, recording positions, avoiding I/O peaks.
Torn Page
If power loss occurs during page flush, a partially written page (torn page) can corrupt data, making recovery impossible without a full snapshot.
Mitigations include double‑write (InnoDB) or full‑page write (PostgreSQL), and restoring from backups.
Disk Full & Damage
When disk space runs out, transactions cannot commit; proactive monitoring and alerts are essential. Disk damage prevents log‑based recovery, requiring alternative strategies.
Backup Strategies
Common backup types:
Full backup: checkpoint then copy all data and subsequent logs.
Incremental backup: checkpoint then copy only changed pages and logs since last backup.
Log archiving: periodic log export.
Typically, full > incremental > archiving in cost. Combining them reduces RPO. Aurora offers near‑real‑time backups (<5 min).
Hot Standby & Replication
Hot standby restores faster than backup restore. Single‑machine hot standby and read‑only replicas improve continuity. Deploy replicas with anti‑affinity to fault domains.
Master‑slave cascades allow multiple replicas per master, supporting read scaling.
Transaction Commit Timing
Three commit levels affect RTO/RPO:
Log flush on primary only (RTO < 1 min, RPO > 0).
Log flush on primary and send to replica (RTO < 1 min, RPO = 0).
Log flush on primary, send and flush on replica (RTO < 1 min, RPO = 0).
In‑city replicas often use the second level; remote replicas use the first.
Shared Disk
Shared‑disk solutions rely on shared storage; replicas replay logs in memory to promote quickly after primary failure. Suitable for same‑city deployments.
Master‑Master
Multi‑master architectures (e.g., Oracle RAC, IBM PureScale, sharding middleware, FDW, MySQL Group Replication, Spanner‑like systems) improve availability because at least one node remains operational.
Paxos/Raft
Paxos and Raft are dominant consensus protocols for distributed replication.
Engineering Reliability Techniques
System Calls
Most error‑prone calls are memory allocation (malloc) and I/O; defensive programming and timeouts are essential.
Checksum
Checksums detect data corruption during write and read phases.
Disk/Connection Heartbeat
Detect stuck processes by periodically touching a file or exposing a health endpoint.
Scheduling, Queues, Priorities, Flow Control
Database performance degrades after a certain connection count; connection pools and priority queues help maintain stability.
Remote Backup
Two main HA patterns: two‑site‑three‑center and multi‑active across regions.
Two‑Site‑Three‑Center
Typical for traditional databases: two local centers and one remote. Remote center often serves as cold standby, with limited performance and reliability.
Paxos does not fit this topology because it requires three equal fault domains.
Multi‑Active
Key concerns: resource allocation, fault isolation, and cross‑region data sync performance. Google Spanner’s timestamp allocation enables distributed operation without a central node.
References
SRE: Google Operations
Google High‑Availability Architecture
The Tail at Scale
Hard Disk Drive Failure
CAP Theorem
The Log: What Every Software Engineer Should Know About Real‑Time Data’s Unifying Abstraction
WAL Internals of PostgreSQL
AWS re:Invent 2016 – Getting Started with Amazon Aurora
Step‑by‑Step Understanding of Paxos
CloudHarmony Data‑Center Monitoring
Author: Xu Wentao Link: https://www.jianshu.com/p/d77ef1f3c939 Originally published on JianShu
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Efficient Ops
This public account is maintained by Xiaotianguo and friends, regularly publishing widely-read original technical articles. We focus on operations transformation and accompany you throughout your operations career, growing together happily.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
