How Relational Databases Ensure Durability: Inside Pages, Undo & Redo Logs
This article explains the internal mechanisms of relational databases, covering data pages, buffer pools, undo and redo logs, checkpointing, and how these components work together to provide atomicity, durability, and crash recovery while minimizing disk I/O.
1. Introduction
When delivering high‑performance Java persistence training, I realized it was necessary to explain how relational databases work, otherwise many transaction concepts such as atomicity, durability, and checkpoints are hard to grasp.
This article provides a high‑level explanation of the internal workings of relational databases and hints at some implementation details.
2. A Picture Is Worth a Thousand Words
3. Data Pages
Disk access is slow, while memory is several orders of magnitude faster. For this reason, database vendors try to delay disk access as much as possible. Whether dealing with tables or indexes, data is divided into fixed‑size pages (e.g., 8 KB).
When data needs to be read, the relational database maps the disk‑based page into a memory buffer. When data is modified, the database changes the in‑memory pages. To synchronize memory pages with disk, a flush operation such as
fsyncis required.
The buffer pool that stores disk‑based pages is limited, so usually only a working set of data can be kept in memory. If the total data exceeds the buffer pool size, old pages must be evicted to make room for new ones.
4. Undo Log
Because changes in memory can be accessed by multiple concurrent transactions, concurrency control mechanisms (e.g., 2PL and MVCC) are needed to ensure data integrity. Thus, once a transaction modifies a row, the uncommitted changes are applied to the in‑memory structures while the previous data is temporarily stored in an append‑only undo log.
Although this structure is called an undo log in Oracle and MySQL, SQL Server uses the transaction log for the same purpose. PostgreSQL has no undo log but achieves the same effect with a multiversion table structure, allowing multiple versions of a row. All these structures provide rollback capability, a mandatory requirement for atomicity.
If a running transaction rolls back, the undo log is used to reconstruct the memory pages as they were at the start of the transaction.
5. Redo Log
Once a transaction commits, the changes in memory must become permanent. However, this does not mean every commit triggers an
fsync, which would severely degrade application performance. ACID guarantees require durability, meaning committed changes must survive even if the database engine is abruptly stopped.
How does a relational database provide durability without issuing an
fsyncon each commit? This is where the redo log comes into play.
The redo log is also an append‑only, disk‑based structure that records every change a transaction makes. When a transaction commits, each modified data page is also written to the redo log. Writing to the redo log is fast because sequential disk access is much quicker than random access, allowing rapid transaction processing.
In Oracle and MySQL this structure is called a redo log ; in SQL Server the transaction log serves the same role. PostgreSQL calls it the write‑ahead log (WAL).
When are the in‑memory changes flushed to disk? Relational database systems use checkpoints to synchronize dirty memory pages with their disk counterparts. To avoid I/O congestion, synchronization is performed in batches over longer intervals.
If the database crashes before all dirty pages are flushed, what happens? Upon restart, the database uses the redo log to reconstruct the disk‑based data pages that were not synchronized since the last successful checkpoint.
6. Conclusion
These design considerations overcome the high latency of disk‑based storage while still providing durable storage guarantees. An undo log supplies atomicity (rollback capability), and a redo log ensures durability of disk‑based pages (tables and indexes).
7. Translator’s Note
Hello everyone, I am Chun Ge from DreamTech (mica open‑source author). Translation is not easy; please share this with more colleagues, thank you!
Java Architecture Diary
Committed to sharing original, high‑quality technical articles; no fluff or promotional content.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.