Databases 47 min read

How ByteKV Achieves Strong Consistency and Horizontal Scalability with Range Partitioning

This article explains ByteKV's design—using range partitioning, a custom Raft implementation, multi‑layer architecture, and advanced load‑balancing strategies—to provide strong consistency, flexible data models, and seamless scalability for large‑scale storage workloads.

dbaplus Community

Sep 29, 2021

How ByteKV Achieves Strong Consistency and Horizontal Scalability with Range Partitioning

Background

After Google released the Spanner paper, many companies introduced databases to address scalability, and ByteDance adopted similar techniques for massive data storage. Different data types (financial, social, metadata) have varying consistency, availability, and scalability requirements, prompting the need for a system that guarantees strong consistency while scaling horizontally.

Architecture Overview

The system consists of five components: SQLProxy , KVProxy , KVClient , KVMaster , and PartitionServer . SQLProxy handles SQL requests, KVProxy handles KV requests, both forward to KVClient, which interacts with KVMaster for timestamps and replica locations and then contacts the appropriate PartitionServer for reads and writes. Data is split into many Partitions , each replicated across PartitionServers using Raft for consistency. Proxies cache replica locations but refresh them when they become stale.

Layered Structure

Interface layer provides KV SDK and richer SQL SDK.

Transaction layer offers snapshot isolation via global timestamps and two‑phase commit.

Elastic scaling layer automatically splits and merges partitions.

Consistency protocol layer uses the self‑developed ByteRaft component.

Storage engine layer uses RocksDB for fast iteration and a custom BlockDB for future needs.

Space management layer handles local disks and shared storage.

External Interfaces

KV Interface

ByteKV exposes Put, Delete, and Get with optional CAS and TTL semantics. Advanced APIs include WriteBatch (atomic multi‑record writes), MultiGet (consistent snapshot reads), MultiWrite (best‑effort writes), and Scan (range, prefix, reverse scans).

Table Interface

On top of KV, the table layer offers SQL‑like operations ( Insert, Update, Delete, Select) with full support for WHERE, ORDER BY, GROUP BY, and aggregation. An ORM library is also provided for application developers.

Key Technologies

ByteRaft

ByteKV uses Raft as the replication algorithm. Each range forms a Raft group, resulting in many groups per cluster. ByteRaft is a C++ multi‑Raft library that merges small I/O requests, pipelines network traffic, and retains pending logs in memory to reduce latency. It also supports Learner and Witness roles to improve resource efficiency and fault tolerance.

Storage Engine

RocksDB serves as the primary engine, offering high performance and rich configuration. To address RocksDB’s limitations for ByteKV’s workload, a custom engine called BlockDB was built. BlockDB provides per‑partition isolation, adaptive compaction, and a block‑level storage system that can run on bare disks or integrate with SPDK for ultra‑low latency.

Distributed Transactions

ByteKV implements snapshot isolation using a global logical clock. Transactions acquire a start timestamp, buffer writes locally, and commit via a two‑phase protocol. Transaction state is stored in a dedicated internal table, and WriteIntent records carry an “infinite” version to hide uncommitted data. Conflict resolution follows a first‑come‑first‑served order, and aborted or timed‑out transactions clean up their intents asynchronously.

Automatic Split & Merge

When a range exceeds a size threshold, ByteKV samples data to pick a split point and creates a new Raft group for the second half. Conversely, small or low‑traffic ranges are merged to avoid fragmentation. Both operations are logged and executed by the Raft leader of the affected range.

Load Balancing

ByteKV employs a multi‑dimensional balancing strategy. Initially, a single‑dimensional (disk space) “sufficient balance” model defines high/low watermarks. Later, additional dimensions (CPU, network, I/O) are added with priority ordering. The scheduler selects replica migrations that improve higher‑priority dimensions while respecting hard limits (e.g., disk saturation). Heterogeneous machine types are handled by converting load metrics into resource‑utilization percentages.

Future Directions

More Consistency Levels

Exploring hybrid logical clocks (HLC) to provide causal consistency and session guarantees for workloads that can tolerate weaker guarantees, reducing cross‑datacenter latency.

Cloud‑Native Integration

Investigating Kubernetes‑native deployment, auto‑scaling, and self‑healing to lower operational costs and improve usability for cloud‑native users.

References

Ongaro & Ousterhout, “In search of an understandable consensus algorithm”, 2013.

RocksDB record format: https://github.com/facebook/mysql-5.6/wiki/MyRocks-record-format#memcomparable-format

Rae et al., “Online, asynchronous schema change in F1”, VLDB 2013.

Kulkarni et al., “Logical physical clocks and consistent snapshots in globally distributed databases”, 2014.

Peng & Dabek, “Large‑scale incremental processing using distributed transactions and notifications”, 2010.

CockroachDB blogs on distributed transactions and isolation.

Corbett et al., “Spanner: Google’s globally distributed database”, 2013.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

load balancing Distributed storage Schema Change strong consistency range partition BlockDB ByteRaft

Written by

dbaplus Community

Enterprise-level professional community for Database, BigData, and AIOps. Daily original articles, weekly online tech talks, monthly offline salons, and quarterly XCOPS&DAMS conferences—delivered by industry experts.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.