Why Distributed Databases Matter: From Early DBMS to Modern NewSQL
This article traces the evolution of database systems—from the first network and hierarchical models through relational databases, NoSQL sharding, and finally modern distributed SQL—explaining why distributed databases emerged, how they handle data distribution, consistency, SQL-to‑KV mapping, and transaction challenges.
Why Distributed Databases?
All storage systems eventually become database systems, as noted by Turing‑award winner Jim Gray. Over the past decades, distributed databases have surged in popularity because traditional single‑node databases cannot meet the scalability, availability, and performance demands of modern internet services.
Evolution of Database Systems
1960s – First DBMS : Charles Bachman created the Integrated Data Store (IDS) using a network model, and IBM introduced the hierarchical IMS system. These early systems lacked tables, SQL, and clear logical/physical separation, requiring pointer‑based traversal and full data reloads for schema changes.
No tables, no SQL.
Data accessed via raw pointers.
Logical and physical layers were not separated; schema changes required reloading all data.
1970s – Relational Model : Edgar F. Codd proposed the relational model to relieve developers from low‑level data handling. IBM’s System R implemented the first SQL engine and transaction support, making queries independent of storage and enabling optimizations behind the scenes. Oracle’s 1978 release sparked commercial adoption.
Rise of NoSQL and Sharding (2000s)
With the explosion of internet traffic, companies needed 24/7 services and replicated, sharded data across many machines. Sharding splits tables by a key (e.g., year) and places each shard on a different node, often behind middleware that rewrites SQL to target the correct shard.
Typical drawbacks of sharding include lack of cross‑shard transactions and complex re‑sharding operations.
Google, among others, built massive sharded databases to achieve extreme scalability, often abandoning relational guarantees in favor of eventual consistency.
Modern Distributed SQL (NewSQL)
Distributed databases aim to combine the relational model’s rich features (SQL, ACID) with NoSQL‑style scalability. Projects such as Google Spanner, CockroachDB, TiDB, and OceanBase exemplify this “NewSQL” or “Distributed SQL” approach.
Spanner’s paper emphasizes that developers should handle performance issues caused by overusing transactions rather than coding around the lack of transactions.
Implementation Details
Data Distribution
Data is typically partitioned (sharded) using either hash‑based or range‑based methods.
Hash sharding : A key is hashed; the hash value determines the node. Fast lookup but unsuitable for range queries.
Range sharding : Keys are divided into contiguous intervals (e.g., A‑Z). Supports efficient range scans but requires an extra lookup to locate the correct node.
When a shard grows beyond a configured threshold (commonly 64 MB–128 MB), it is split into two smaller ranges to balance load.
Data Consistency
Distributed systems replicate data across multiple nodes to survive failures. Common replication methods include primary‑secondary sync and consensus algorithms such as Paxos, Raft, and Zab. Raft, derived from Paxos, is widely adopted because of its understandable design.
Raft achieves consistency by requiring a majority of nodes to acknowledge a write before it is considered committed; it can elect a new leader if the current one fails, tolerating up to half of the nodes being down.
SQL‑to‑KV Mapping
To store relational tables on a key‑value engine, each row and column is encoded as a hierarchical key, e.g., /table/index/key/column → value. Primary keys become part of the key path, ensuring uniqueness. Non‑unique indexes are stored similarly but may contain empty values.
Different databases adopt slightly different encodings; CockroachDB’s “Structured data encoding” and TiDB’s region‑based mapping are representative examples.
Distributed Transactions
Distributed transactions must preserve ACID properties. Atomicity is typically achieved with two‑phase commit (2PC); isolation relies on two‑phase locking or multi‑version concurrency control (MVCC). Most NewSQL systems implement MVCC; Spanner uses TrueTime, while TiDB builds on Google’s Percolator design.
Full discussion of distributed transaction protocols is beyond this article’s scope.
Conclusion
Distributed databases combine sharding, consensus‑based replication, KV‑based storage, and sophisticated transaction mechanisms to provide both scalability and rich relational features. Open‑source projects such as CockroachDB, TiDB, and OceanBase serve as excellent learning resources for engineers exploring this field.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
dbaplus Community
Enterprise-level professional community for Database, BigData, and AIOps. Daily original articles, weekly online tech talks, monthly offline salons, and quarterly XCOPS&DAMS conferences—delivered by industry experts.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
