TiDB Technical Deep Dive – Storage, Compute, and Scheduling Architecture
This article provides a comprehensive technical overview of TiDB, covering its HTAP design, TiKV storage engine with RocksDB and Raft replication, the mapping of relational tables to key‑value pairs, MVCC implementation, transaction handling, and the PD scheduler that balances replicas, leaders, and hot spots across a distributed cluster.
TiDB Technical Deep Dive – Storage Chapter
TiDB is an open‑source distributed HTAP database designed by PingCAP, compatible with MySQL, offering horizontal scalability, strong consistency, and high availability for both OLTP and OLAP workloads.
Highly compatible with MySQL, enabling migration without code changes.
Horizontal elastic scaling by adding nodes.
100% ACID‑compliant distributed transactions.
Financial‑grade high availability using Raft consensus.
HTAP solution with TiSpark for complex analytics.
Cloud‑native design for public, private, and hybrid clouds.
TiKV Storage Engine
TiKV implements a key‑value model stored in RocksDB, providing an ordered map where keys are raw byte arrays.
Key1 -> Value<br/>Key2 -> Value<br/>...<br/>KeyN -> ValueWith MVCC, each version is encoded as a suffix:
Key1-Version3 -> Value<br/>Key1-Version2 -> Value<br/>Key1-Version1 -> Value<br/>...<br/>KeyN-Version1 -> ValueRegion Partitioning
Data is split into contiguous key ranges called Regions (default 64 MiB). Regions are distributed across nodes and each Region forms a Raft group for replication.
Raft Replication
Raft provides leader election, membership changes, and log replication, ensuring that each Region’s data is safely replicated to a majority of nodes.
Compute Chapter
The SQL layer maps relational tables to TiKV key‑value pairs. Each table receives a unique TableID, each index an IndexID, and each row a RowID.
Row encoding example:
Key: tablePrefix{tableID}_recordPrefixSep{rowID}<br/>Value: [col1, col2, col3, col4]Index encoding example (unique index):
Key: tablePrefix{tableID}_indexPrefixSep{indexID}_indexedColumnsValue<br/>Value: rowIDNon‑unique index adds the RowID to guarantee uniqueness:
Key: tablePrefix{tableID}_indexPrefixSep{indexID}_indexedColumnsValue_rowID<br/>Value: nullThese encodings are memcomparable, preserving ordering after binary encoding, which enables efficient point lookups and range scans.
SQL Execution
SQL statements are translated into KV operations. For example, SELECT COUNT(*) FROM user WHERE name='TiDB' is executed by constructing a key range for the table, scanning rows, filtering on the name column, and aggregating the count, with push‑down of filters and aggregations to TiKV when possible.
Scheduling Chapter
PD (Placement Driver) acts as the central scheduler, collecting node status and Region leader heartbeats to make placement decisions.
Ensures each Region has the correct number of replicas.
Distributes replicas across distinct locations (nodes, racks, data centers) using label‑based placement.
Balances replica and leader distribution for even load.
Detects hot spots and re‑balances them.
Controls migration speed to avoid impacting online services.
Supports manual node decommissioning.
PD issues three basic operations to TiKV: AddReplica, RemoveReplica, and TransferLeader, which are executed by Raft groups based on the scheduler’s plan.
Overall, the article explains how TiDB combines a MySQL‑compatible front‑end with a distributed KV store, leverages Raft for fault‑tolerant replication, and uses PD to continuously balance resources and maintain high availability.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Big Data Technology & Architecture
Wang Zhiwu, a big data expert, dedicated to sharing big data technology.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
