Databases 26 min read

How Aurora Redefines Cloud‑Native Relational Databases: Architecture, Benefits, and Limits

This article interprets the 2017 SIGMOD paper on Amazon Aurora, explaining its cloud‑native architecture, key advantages such as self‑healing storage and reduced network I/O, the challenges it introduces, a comparison with traditional databases and TiDB, and insights into deployment, replication, and recovery mechanisms.

dbaplus Community
dbaplus Community
dbaplus Community
How Aurora Redefines Cloud‑Native Relational Databases: Architecture, Benefits, and Limits

Introduction

The 2017 SIGMOD paper Amazon Aurora: Design Considerations for High‑Throughput Cloud‑Native Relational Databases describes Aurora, an AWS‑hosted MySQL‑compatible OLTP database that moves the write‑ahead log (WAL) to a distributed, multi‑tenant storage service. By treating the log as the database, Aurora reduces network I/O, enables fast multi‑replica recovery, and provides a self‑healing storage layer.

Background and Design Rationale

Aurora’s designers argue that in modern cloud environments the primary bottleneck shifts from compute and local storage I/O to network I/O between the database engine and storage. Pushing only redo logs to a fault‑tolerant storage service (spanning three Availability Zones) cuts network IOPS by an order of magnitude and eliminates the need for traditional checkpointing and crash‑recovery mechanisms.

System Architecture

Aurora follows a Shared‑Disk‑plus‑Process (SDP) model:

Process Engine (PE) : SQL parsing, optimization, transaction management, lock management, buffer cache, and access methods – essentially the MySQL query processor.

Storage Engine (SE) : A separate distributed storage service that stores only redo logs and assembles pages from those logs.

The storage service is an independent, multi‑AZ, fault‑tolerant cluster. Each data segment is 10 GB; six copies of every segment are kept (two copies per AZ). Writes are considered durable when a quorum (four of six) acknowledges the log, and reads succeed with a majority of three copies.

Storage Node Workflow

Receive redo‑log records and enqueue them in memory.

Persist the records to local disk and acknowledge the write.

Batch‑order logs, detect missing batches.

Gossip with peer nodes to align cluster view and version.

Merge logs into new data pages.

Periodically replicate pages and logs to S3 for durability.

Run garbage‑collection on obsolete versions.

Validate CRC of pages and repair corrupted blocks by fetching from peers.

Only steps 1 and 2 are required for a transaction commit; the remaining steps run asynchronously.

Read/Write Processing

Write Path

When a transaction commits, the engine records a commit LSN and continues processing. The commit is durable once the Volume Durable LSN (VDL) exceeds the commit LSN and at least four storage nodes have ACKed the log. This asynchronous commit replaces the traditional two‑phase commit where the client thread blocks until the storage ACK.

Read Path

Read requests first check the buffer cache. If a page is missing, the request is sent to the storage service. A cached page can be evicted only after its LSN ≤ VDL, guaranteeing that all updates are persisted before eviction. Reads normally use a single‑node read point (the current VDL); majority voting is used only during fail‑over or recovery.

Replication and Scaling

Aurora can attach up to 15 read‑only replicas to the same shared storage volume. Writes are broadcast to all replicas simultaneously, minimizing latency. The cluster consists of one writer instance and up to 15 read/backup replicas. Amazon RDS acts as the host manager and control plane, monitoring health and performing automatic fail‑over.

Key Terminology

LSN (Log Sequence Number): Monotonically increasing identifier for each log record.

VDL (Volume Durable LSN): Highest LSN that has been persisted on a quorum of storage nodes.

VCL (Volume Complete LSN): Maximum LSN the storage service can guarantee.

CPL (Consistency Point LSN): LSN at which the storage system has a consistent view.

SCL (Segment Complete LSN): Highest LSN fully written for a given segment.

Recovery

During crash recovery Aurora determines the largest VCL, truncates logs beyond that point, and replays only logs with LSN ≤ CPL. Because redo logs are continuously replicated, most recovery work can be performed in parallel across segments, reducing downtime.

Benchmark (as reported in the SIGMOD paper)

The original study shows an order‑of‑magnitude improvement in write throughput compared with a conventional MySQL deployment on the same hardware. No independent verification is provided in this summary.

Comparison with TiDB (and other NewSQL systems)

Both Aurora and TiDB belong to the NewSQL family but follow different architectural directions:

Aurora : Shared‑disk style, MySQL‑compatible storage engine, single‑writer (writes limited to one instance), up to 15 read replicas, 64 TB data limit, strong read/write latency, tightly coupled to AWS services (low portability).

TiDB : Shared‑nothing architecture built on Raft‑based TiKV (LSM‑tree), true horizontal write scalability, unlimited data size, open‑source, high portability across clouds and on‑premises, but higher write latency under contention due to distributed transaction coordination.

Conclusion

By decoupling compute from storage and delegating logging, recovery, and replication to a distributed storage service, Aurora achieves roughly ten‑fold higher write capacity for OLTP workloads while preserving MySQL compatibility. However, its single‑writer design limits true scale‑out capabilities. TiDB offers genuine scale‑out, multi‑region deployment, and open‑source flexibility at the cost of higher write latency in contention‑heavy scenarios. Both illustrate the NewSQL trend of moving expensive database operations into specialized cloud services.

References

https://amazonaws-china.com/cn/rds/Aurora/

http://www.allthingsdistributed.com/files/p1041-verbitski.pdf

https://www.slideshare.net/AmazonWebServices/aws-reinvent-2016-deep-dive-on-amazon-aurora-dat303

http://mp.weixin.qq.com/s/COa2Q8iwhFUSbdB5LrJ9sA

http://www.tuicool.com/articles/2uMR3y7

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

Database ArchitectureReplicationNewSQLcloud-native databaseAmazon AuroraTiDB Comparison
dbaplus Community
Written by

dbaplus Community

Enterprise-level professional community for Database, BigData, and AIOps. Daily original articles, weekly online tech talks, monthly offline salons, and quarterly XCOPS&DAMS conferences—delivered by industry experts.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.