How ByteKV Achieves Strong Consistency and Horizontal Scalability with Range Partitioning
This article explains ByteKV's design—using range partitioning, a custom Raft implementation, multi‑layer architecture, and advanced load‑balancing strategies—to provide strong consistency, flexible data models, and seamless scalability for large‑scale storage workloads.
Background
After Google released the Spanner paper, many companies introduced databases to address scalability, and ByteDance adopted similar techniques for massive data storage. Different data types (financial, social, metadata) have varying consistency, availability, and scalability requirements, prompting the need for a system that guarantees strong consistency while scaling horizontally.
Architecture Overview
The system consists of five components: SQLProxy , KVProxy , KVClient , KVMaster , and PartitionServer . SQLProxy handles SQL requests, KVProxy handles KV requests, both forward to KVClient, which interacts with KVMaster for timestamps and replica locations and then contacts the appropriate PartitionServer for reads and writes. Data is split into many Partitions , each replicated across PartitionServers using Raft for consistency. Proxies cache replica locations but refresh them when they become stale.
Layered Structure
Interface layer provides KV SDK and richer SQL SDK.
Transaction layer offers snapshot isolation via global timestamps and two‑phase commit.
Elastic scaling layer automatically splits and merges partitions.
Consistency protocol layer uses the self‑developed ByteRaft component.
Storage engine layer uses RocksDB for fast iteration and a custom BlockDB for future needs.
Space management layer handles local disks and shared storage.
External Interfaces
KV Interface
ByteKV exposes Put, Delete, and Get with optional CAS and TTL semantics. Advanced APIs include WriteBatch (atomic multi‑record writes), MultiGet (consistent snapshot reads), MultiWrite (best‑effort writes), and Scan (range, prefix, reverse scans).
Table Interface
On top of KV, the table layer offers SQL‑like operations ( Insert, Update, Delete, Select) with full support for WHERE, ORDER BY, GROUP BY, and aggregation. An ORM library is also provided for application developers.
Key Technologies
ByteRaft
ByteKV uses Raft as the replication algorithm. Each range forms a Raft group, resulting in many groups per cluster. ByteRaft is a C++ multi‑Raft library that merges small I/O requests, pipelines network traffic, and retains pending logs in memory to reduce latency. It also supports Learner and Witness roles to improve resource efficiency and fault tolerance.
Storage Engine
RocksDB serves as the primary engine, offering high performance and rich configuration. To address RocksDB’s limitations for ByteKV’s workload, a custom engine called BlockDB was built. BlockDB provides per‑partition isolation, adaptive compaction, and a block‑level storage system that can run on bare disks or integrate with SPDK for ultra‑low latency.
Distributed Transactions
ByteKV implements snapshot isolation using a global logical clock. Transactions acquire a start timestamp, buffer writes locally, and commit via a two‑phase protocol. Transaction state is stored in a dedicated internal table, and WriteIntent records carry an “infinite” version to hide uncommitted data. Conflict resolution follows a first‑come‑first‑served order, and aborted or timed‑out transactions clean up their intents asynchronously.
Automatic Split & Merge
When a range exceeds a size threshold, ByteKV samples data to pick a split point and creates a new Raft group for the second half. Conversely, small or low‑traffic ranges are merged to avoid fragmentation. Both operations are logged and executed by the Raft leader of the affected range.
Load Balancing
ByteKV employs a multi‑dimensional balancing strategy. Initially, a single‑dimensional (disk space) “sufficient balance” model defines high/low watermarks. Later, additional dimensions (CPU, network, I/O) are added with priority ordering. The scheduler selects replica migrations that improve higher‑priority dimensions while respecting hard limits (e.g., disk saturation). Heterogeneous machine types are handled by converting load metrics into resource‑utilization percentages.
Future Directions
More Consistency Levels
Exploring hybrid logical clocks (HLC) to provide causal consistency and session guarantees for workloads that can tolerate weaker guarantees, reducing cross‑datacenter latency.
Cloud‑Native Integration
Investigating Kubernetes‑native deployment, auto‑scaling, and self‑healing to lower operational costs and improve usability for cloud‑native users.
References
Ongaro & Ousterhout, “In search of an understandable consensus algorithm”, 2013.
RocksDB record format: https://github.com/facebook/mysql-5.6/wiki/MyRocks-record-format#memcomparable-format
Rae et al., “Online, asynchronous schema change in F1”, VLDB 2013.
Kulkarni et al., “Logical physical clocks and consistent snapshots in globally distributed databases”, 2014.
Peng & Dabek, “Large‑scale incremental processing using distributed transactions and notifications”, 2010.
CockroachDB blogs on distributed transactions and isolation.
Corbett et al., “Spanner: Google’s globally distributed database”, 2013.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
dbaplus Community
Enterprise-level professional community for Database, BigData, and AIOps. Daily original articles, weekly online tech talks, monthly offline salons, and quarterly XCOPS&DAMS conferences—delivered by industry experts.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
