Databases 47 min read

ByteKV: Design and Implementation of a Strongly Consistent Range-Partitioned KV Store

ByteKV is a C++-based, strongly consistent, range-partitioned key‑value storage system built by ByteDance, featuring a multi‑Raft consensus layer, custom storage engines (RocksDB and BlockDB), automatic partition splitting/merging, load balancing, distributed transactions, and a SQL table layer for rich data models.

DataFunTalk
DataFunTalk
DataFunTalk
ByteKV: Design and Implementation of a Strongly Consistent Range-Partitioned KV Store

ByteDance developed ByteKV to address the scalability and consistency challenges of massive data storage, adopting a range‑partitioned architecture that provides strong consistency while supporting flexible access patterns.

System Architecture

The system consists of five core components: SQLProxy, KVProxy, KVClient, KVMaster, and PartitionServer. SQLProxy handles SQL requests, KVProxy handles KV requests, both forward operations to KVClient, which interacts with KVMaster for timestamps and replica locations and then communicates with the appropriate PartitionServer for reads and writes. PartitionServers store user data, while KVMaster balances replicas across the cluster.

Key Technologies

ByteRaft

ByteKV uses a custom multi‑Raft library, ByteRaft, to replicate logs across many Raft groups (one per range). ByteRaft merges small I/O requests, pipelines network traffic, retains pending logs in memory, and supports Learner and Witness roles to improve scalability and reduce resource usage.

Storage Engine

RocksDB is used as the primary engine for rapid iteration, with Table Properties collectors employed for efficient range splitting, garbage collection, and compaction tuning. To overcome RocksDB’s limitations for ByteKV’s workload, a bespoke engine called BlockDB was created, storing data in 128 KB blocks and providing per‑partition resource isolation, adaptive compaction, and a lightweight block‑system for high‑performance SSDs.

Distributed Transactions

ByteKV provides snapshot‑isolated transactions using a global monotonically increasing timestamp service, multi‑version concurrency control, and two‑phase commit. Transaction state is stored in an internal KV table, and WriteIntent records use an infinite version number to hide uncommitted writes. Conflict detection is optimistic, aborting transactions when newer versions appear.

Automatic Partition Split & Merge

Ranges are split when size thresholds are exceeded, using sampled Table Properties to estimate split points without full scans. Small or low‑traffic ranges are merged to avoid fragmentation. Both operations are coordinated through KVMaster and executed as Raft log entries.

Load Balancing

ByteKV employs a multi‑dimensional load‑balancing scheduler that considers disk usage, CPU/IO utilization, and traffic. Nodes are classified into zones (high load, balanced, low load) using high/low watermarks, and migrations are performed respecting priority dimensions and hard resource limits, with special handling for heterogeneous machines.

Table Layer (ByteSQL)

On top of the KV store, ByteSQL provides a relational model with databases, tables, schemas, primary and secondary indexes, and an ORM library. Index rows are encoded as KV pairs; primary keys store full rows, unique indexes map indexed fields to primary keys, and non‑unique indexes store indexed fields plus primary keys with a null placeholder.

<span>Primary Key: pk_field1, pk_field2,... => non_pk_field1, non_pk_field2...</span></code><code><span>Unique Key: key_field1, key_field2,...=> pk_field1, pk_field2...</span></code><code><span>NonUnique Key: key_field1, key_field2,..., pk_field1, pk_field2...=> <null></span>

ByteSQL supports global secondary indexes, interactive snapshot‑isolated transactions, and a RETURNING clause to avoid extra reads. Schema changes are performed online using a three‑state workflow (DeleteOnly → WriteOnly → Public) inspired by Google F1, ensuring no lost writes or deletes during index creation.

Future Directions

Planned enhancements include additional consistency levels via hybrid logical clocks for causal consistency, multi‑region eventual consistency modes, and deeper Cloud‑Native integration with Kubernetes for automated deployment, scaling, and healing.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

Distributed SystemsSQLStorage EngineConsensusPartitioningkey-value store
DataFunTalk
Written by

DataFunTalk

Dedicated to sharing and discussing big data and AI technology applications, aiming to empower a million data scientists. Regularly hosts live tech talks and curates articles on big data, recommendation/search algorithms, advertising algorithms, NLP, intelligent risk control, autonomous driving, and machine learning/deep learning.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.