Inside YugabyteDB: Architecture, Tablet Storage, and Distributed Transactions Explained
This article provides a comprehensive technical overview of YugabyteDB, covering its two‑layer logical architecture, tablet‑based distributed storage with Raft groups, RocksDB‑backed local storage design, hybrid hash‑range partitioning, and the MVCC‑based two‑phase‑commit transaction model using Hybrid Logical Clocks.
System Architecture
YugabyteDB follows a logical two‑layer design consisting of a query layer and a storage layer, both running inside the TServer process. The query layer exposes two APIs: SQL (a PostgreSQL‑compatible dialect) and CQL (Cassandra‑compatible). The storage layer is where the core functionality resides.
Tablet‑Based Distributed Storage
Data is split into tablets, the smallest unit of distribution, similar to HBase or Spanner. Each tablet forms a Raft group with multiple replicas spread across three nodes to ensure high availability. The Master node manages metadata such as tablet locations and schema information, also using Raft for its own fault tolerance.
YugabyteDB supports flexible partitioning schemes: hash‑only, range‑only, or a combination of hash‑then‑range, a design influenced by Cassandra. Hash partitioning maps keys to a 2‑byte space ( 0x0000 ‑ 0xFFFF) which is further divided into ranges; up to 64K tablets are possible. Hash partitioning avoids write hotspots for append‑heavy workloads but can degrade performance for small range scans (e.g., pk BETWEEN 1 AND 10).
RocksDB‑Backed Local Storage (DocDB)
Each TServer hosts a local DocDB built on RocksDB. Tuples and documents are encoded as key‑value pairs. The key consists of a 16‑bit hash (for hash partitioning), primary‑key columns, a column ID (to represent individual columns), and a hybrid timestamp used for MVCC. The value stores the column's actual data.
Key components
16‑bit hash
Primary‑key data
Column ID
Hybrid timestamp
Value component
Column value
Distributed Transactions: 2PC & MVCC
Timestamp
YugabyteDB uses a Hybrid Logical Clock (HLC) for transaction timestamps, combining a physical component (UNIX time) with a logical Lamport counter. Within the same millisecond the physical part stays constant while the logical part increments on each RPC, providing a partial order of events.
HLC offers external consistency similar to Google’s TrueTime but without requiring a dedicated time‑serving node. An alternative is a centralized Timestamp Oracle (TSO) as used by TiDB, which simplifies timestamp acquisition but creates a single point of failure.
Transaction Commit
Transactions are implemented with two‑phase commit (2PC) and MVCC. During commit, YugabyteDB writes provisional records to DocDB, categorized as:
Primary provisional records – uncommitted data with a transaction ID, acting as a lock.
Transaction metadata – stores the tablet ID where the transaction state resides.
Reverse index – maps each primary provisional record for recovery.
The transaction state is kept in a separate tablet with three possible statuses: Pending, Committed, or Aborted. Transition to the Committed state marks the commit point, guaranteeing atomicity.
YugabyteDB also supports Snapshot Isolation and, as of version 2.0 GA, Serializable isolation, though details on write‑skew prevention are not yet documented.
Competitive Comparison
A comparison table (sourced from Yugabyte’s documentation) positions YugabyteDB alongside TiDB, CockroachDB, and other distributed databases, highlighting similarities in architecture, global distribution, and ACID transaction support.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
ITPUB
Official ITPUB account sharing technical insights, community news, and exciting events.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
