Databases 22 min read

Why Do Database Failures Happen and How to Prevent Them?

This article examines common hardware and network failures in data centers, analyzes real‑world outage cases, classifies fault domains, and presents comprehensive strategies for database fault handling—including logging, checkpointing, backup, replication, and high‑availability architectures—to improve reliability and reduce downtime.

Efficient Ops
Efficient Ops
Efficient Ops
Why Do Database Failures Happen and How to Prevent Them?

Fault Overview

ACID defines four transaction properties; durability (the D) ensures data is not lost despite failures. As distributed databases grow, complexity and failure scenarios increase.

Common Hardware Failures

Typical data‑center failure probabilities are illustrated below.

Designs, Lessons and Advice from Building Large Distributed Systems, Jeff Dean

Network Failures

Beyond hardware, distributed systems must consider additional network faults such as brain split (partition) and multi‑network‑segment failures.

Fragile Data Centers

Monitoring data from CloudHarmony (Nov 22 2017) shows many data centers, including Azure, fall short of the advertised 99.95% availability.

Notable outage examples:

Azure Nordic data‑center (Sep 29 2017): fire‑suppression system released agent, shutting down air‑handling units and causing a 7‑hour service disruption.

Amazon S3 (Feb 28 2017): accidental deletion of many servers made S3 unavailable for over 4 hours.

Google Compute Engine (Apr 13 2016): deletion of unused IP blocks triggered a network‑config consistency check, causing an 18‑minute outage.

Hangzhou Telecom cut Alibaba fiber (May 27 2015): 2‑hour outage.

Ningxia Bank core DB failure (Jul 1 2014): backup system error led to a 37‑hour service interruption.

Fault Classification

Fault domains group components that become unavailable together, such as physical machines, shared‑power racks, shared‑network racks, single‑fiber‑affected centers, or geographically co‑located centers.

Fault Evolution

Disk failure rates increase with age; research shows 1.7% failure in year 1, rising to 8.6% by year 3. AI and big‑data techniques are now applied to predict disk failures.

Database Fault Handling

Log Systems

Databases record changes via logs (redo, undo, or combined). PostgreSQL log structure is shown below.

Two log types:

Physical logs: fast replay, large size, simple implementation.

Logical logs: slower replay, smaller size, beneficial for MVCC databases.

Key principles:

WAL: log must be flushed before page flush; sync is required at transaction commit.

Recovery needs a complete log and a consistent data snapshot as a starting point.

Logs are a universal abstraction for state changes, enabling recovery, backup, and notification systems.

Log Recycling

Logs grow with data changes; recycling frees space and reduces recovery time. MVCC databases can cap log size independent of transaction commit.

Checkpoint

Checkpoints reclaim logs. Full checkpoint steps: record log position, flush all dirty pages with sync, write checkpoint record, recycle older logs (causes I/O spike). Incremental checkpoint writes pages in order of first modification, recording positions, avoiding I/O peaks.

Torn Page

If power loss occurs during page flush, a partially written page (torn page) can corrupt data, making recovery impossible without a full snapshot.

Mitigations include double‑write (InnoDB) or full‑page write (PostgreSQL), and restoring from backups.

Disk Full & Damage

When disk space runs out, transactions cannot commit; proactive monitoring and alerts are essential. Disk damage prevents log‑based recovery, requiring alternative strategies.

Backup Strategies

Common backup types:

Full backup: checkpoint then copy all data and subsequent logs.

Incremental backup: checkpoint then copy only changed pages and logs since last backup.

Log archiving: periodic log export.

Typically, full > incremental > archiving in cost. Combining them reduces RPO. Aurora offers near‑real‑time backups (<5 min).

Hot Standby & Replication

Hot standby restores faster than backup restore. Single‑machine hot standby and read‑only replicas improve continuity. Deploy replicas with anti‑affinity to fault domains.

Master‑slave cascades allow multiple replicas per master, supporting read scaling.

Transaction Commit Timing

Three commit levels affect RTO/RPO:

Log flush on primary only (RTO < 1 min, RPO > 0).

Log flush on primary and send to replica (RTO < 1 min, RPO = 0).

Log flush on primary, send and flush on replica (RTO < 1 min, RPO = 0).

In‑city replicas often use the second level; remote replicas use the first.

Shared Disk

Shared‑disk solutions rely on shared storage; replicas replay logs in memory to promote quickly after primary failure. Suitable for same‑city deployments.

Master‑Master

Multi‑master architectures (e.g., Oracle RAC, IBM PureScale, sharding middleware, FDW, MySQL Group Replication, Spanner‑like systems) improve availability because at least one node remains operational.

Paxos/Raft

Paxos and Raft are dominant consensus protocols for distributed replication.

Engineering Reliability Techniques

System Calls

Most error‑prone calls are memory allocation (malloc) and I/O; defensive programming and timeouts are essential.

Checksum

Checksums detect data corruption during write and read phases.

Disk/Connection Heartbeat

Detect stuck processes by periodically touching a file or exposing a health endpoint.

Scheduling, Queues, Priorities, Flow Control

Database performance degrades after a certain connection count; connection pools and priority queues help maintain stability.

Remote Backup

Two main HA patterns: two‑site‑three‑center and multi‑active across regions.

Two‑Site‑Three‑Center

Typical for traditional databases: two local centers and one remote. Remote center often serves as cold standby, with limited performance and reliability.

Paxos does not fit this topology because it requires three equal fault domains.

Multi‑Active

Key concerns: resource allocation, fault isolation, and cross‑region data sync performance. Google Spanner’s timestamp allocation enables distributed operation without a central node.

References

SRE: Google Operations

Google High‑Availability Architecture

The Tail at Scale

Hard Disk Drive Failure

CAP Theorem

The Log: What Every Software Engineer Should Know About Real‑Time Data’s Unifying Abstraction

WAL Internals of PostgreSQL

AWS re:Invent 2016 – Getting Started with Amazon Aurora

Step‑by‑Step Understanding of Paxos

CloudHarmony Data‑Center Monitoring

Author: Xu Wentao Link: https://www.jianshu.com/p/d77ef1f3c939 Originally published on JianShu
Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

Distributed Systemsdatabaseloggingfault toleranceBackup
Efficient Ops
Written by

Efficient Ops

This public account is maintained by Xiaotianguo and friends, regularly publishing widely-read original technical articles. We focus on operations transformation and accompany you throughout your operations career, growing together happily.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.