Databases 23 min read

Designing ZanKV: A Scalable Distributed KV Store Built on RocksDB, Raft, and Redis Protocol

This article details the design, architecture, and implementation of ZanKV—a high‑performance, distributed key‑value store that combines RocksDB storage, etcd‑Raft consensus, and a Redis‑compatible protocol, covering data partitioning, namespace isolation, expiration strategies, cross‑datacenter deployment, and performance tuning.

Youzan Coder

Aug 17, 2018

Designing ZanKV: A Scalable Distributed KV Store Built on RocksDB, Raft, and Redis Protocol

Background

Initially Youzan used MySQL for persistent storage and Codis for caching. As the business grew, a high‑performance NoSQL store was required, leading to the adoption of Aerospike. Aerospike’s in‑memory index and cost of scaling eventually motivated the development of a custom KV service that could evolve with future requirements.

Design and Architecture

The design goals were to use actively maintained open‑source components, avoid tight coupling to a single product, keep the stack simple, enable easy extension, and provide stable business‑level APIs despite backend changes.

A proxy layer hides backend details and exposes the Redis protocol to clients. This allows existing Aerospike and Codis clusters to be integrated while keeping the option to plug in other storage engines later.

Implementation Details

DataNode (Data Storage Node)

RocksDB is used as the underlying storage engine for its performance and community support. A mapping layer on top of RocksDB implements richer data structures required by the Redis protocol, inspired by projects such as Pika, Ledis, and TiKV.

Data replication and consistency are achieved with the Raft consensus algorithm provided by etcd‑raft. To scale beyond a single Raft group, ZanKV partitions the keyspace into multiple Raft groups, each handling a subset of partitions.

Namespace and Partitioning

ZanKV uses a prefix‑hash partitioning algorithm. Keys sharing the same prefix remain globally ordered, while the algorithm simplifies implementation and scaling. Namespaces provide logical isolation; each namespace can have independent replica counts, partition numbers, and placement policies, enabling smooth upgrades and feature roll‑outs without affecting existing workloads.

PlaceDriver Node (Global Management)

The PlaceDriver (PD) node is a stateless service that watches cluster changes via etcd, assigns partitions to DataNodes, and rebalances data when nodes fail or are added. Partition‑to‑node mapping is stored in an internal table rather than using consistent hashing, allowing flexible load distribution.

The read/write flow is: the client hashes the key, determines the partition ID, looks up the partition‑to‑node map, contacts the appropriate DataNode, and the node forwards the request to the leader of the Raft group that owns the partition. Errors trigger leader refresh and retry.

Data Expiration

ZanKV supports two expiration strategies:

Consistent expiration : Stores both the key’s TTL (Table 1) and a secondary index keyed by expiration timestamp (Table 2). Leaders scan Table 2, issue Raft‑based delete commands, and remove entries from both tables. This guarantees strong consistency but can generate heavy Raft traffic under massive expirations.

Inconsistent local deletion : Only Table 2 is kept; each node periodically scans it (every 5 minutes) and deletes expired keys locally without Raft coordination. This reduces write amplification and network load but sacrifices precise TTL queries and immediate deletion guarantees.

For massive time‑series workloads, a prefix‑based bulk‑clean API allows batch deletion of keys within a time range.

Cross‑Datacenter Deployment

Two deployment modes are offered:

Single multi‑datacenter cluster : A single Raft‑based cluster spans multiple data centers, with replicas evenly distributed. Raft tolerates the loss of any single data center while preserving consistency.

Multiple clusters with asynchronous sync : Separate clusters in each data center replicate Raft logs asynchronously via learner nodes. This reduces latency for local traffic but may introduce temporary inconsistency during failures.

Performance Tuning

Key tuning areas include:

RocksDB block cache sizing (10‑30% of total memory).

Write buffer configuration ensuring

level0_file_num_compaction_trigger*write_buffer_size*min_write_buffer_number_tomerge=max_bytes_for_level_base

Background I/O throttling.

Iterator upper‑bound optimization using rocksdb::ReadOptions::iterate_upper_bound.

Disabling transparent huge pages:

# echo never > /sys/kernel/mm/redhat_transparent_hugepage/enabled

# echo never > /sys/kernel/mm/redhat_transparent_hugepage/defrag

Roadmap

Future work includes:

Secondary indexing for hash types.

Optimizing Raft log storage to keep most logs on disk while retaining a small in‑memory tail.

Multi‑field index filtering for richer query capabilities.

Real‑time export and OLAP integration via Raft learners.

The project is open source at https://github.com/youzan/ZanRedisDB. Contributions are welcome.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

distributed systems Performance Tuning RocksDB KV store Raft Redis Protocol

Written by

Youzan Coder

Official Youzan tech channel, delivering technical insights and occasional daily updates from the Youzan tech team.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.