Cut Storage Costs 400%: Inside BitalosDB’s High‑Performance KV Engine
An in‑depth look at BitalosDB, the home‑grown NoSQL storage engine behind Zuoyebang’s massive KV traffic, covering its novel IO architecture, KV‑separation design, Raft‑based consistency, multi‑cloud CRDT replication, and benchmark results that show up to 400% cost savings versus standard Redis.
Overview
Project background: To support massive cache demand and complex I/O scenarios for online services, the goal is to handle larger traffic and data at lower cost.
Project Status
Handles 90% of Zuoyebang’s KV storage traffic, peak QPS 15 million.
Cache & storage volume: 130 TB.
Average read latency: 0.1 ms; write latency: 0.15 ms.
Availability: 99.9999%.
Project Benefits
Compared with standard Redis, current storage volume saves 400% in cost.
Key Technologies
BitalosDB : a self‑developed storage engine with a new I/O architecture for extreme performance.
Raft Protocol : heavily optimized to boost write performance and data synchronization, with improved election strategy for higher cluster stability.
Multi‑cloud Multi‑master (CRDT) : ensures conflict‑free writes across multiple clouds, achieving eventual consistency.
Redis Compatibility : supports the Redis protocol for seamless migration to Stored.
Storage Panorama
Storage Engine
Problem
Standard LSM‑Tree suffers from read‑write amplification; as data scale grows, resource consumption for amplification increases. The challenge is to support larger write volumes and higher read traffic at lower cost.
Solution
Use Bitalos‑Trees to solve read amplification, Bithash for KV separation to solve write amplification, and separate hot‑cold data to further save memory and disk.
BitalosDB I/O Architecture
Bitalos‑Trees handle data updates and hot data storage, providing high‑performance indexing while eliminating read amplification.
Bithash stores values, delivering high‑performance reads/writes and eliminating write amplification.
Bitable stores cold data; based on data size and access frequency, cold data is moved to Bitable during low‑traffic periods, improving compression and reducing index memory usage.
KV Separation – Technical Analysis
Option A
Option B
Option C
Analysis
Summary of Options
Options A & B require extra CPU and I/O for index queries/updates during vlog‑GC.
Option C triggers multiple random reads during vLog reads, leaving room for read performance improvement.
BitalosDB enables closed‑loop GC inside vLog without index queries/updates while maintaining high‑performance vLog reads.
BitalosDB KV‑Separation Technology (Bithash)
File Structure
Data Write
Data Read
Index Write
When a single file’s write volume exceeds the Bithash file capacity threshold, the current Bithash file is closed and the in‑memory index is flushed to disk.
BitalosDB Index Technology (Bitalos‑Tree)
Prefix‑tree based hierarchical B+ tree
Layered Process
Each dashed box represents a B+ tree; for each Trie Layer, the key is sliced by M bytes for indexing. Keys sharing the same first M bytes reside in the same layer.
In extreme cases where all keys are M bytes, they belong to Trie Layer 0. Adding a key longer than 10 × M bytes does not automatically place it in Trie Layer 10; placement depends on the shared prefix.
Performance
Benchmark against RocksDB V7.6.0 (latest at the time).
Machine configuration: Intel Xeon Platinum 8255C CPU @ 2.50 GHz; 2 × 3.5 TB NVMe SSD (RAID 0).
Test settings: Cgroup 8 Core; Concurrency 8; Key 32 B, Value 1 KB (100% random).
Benchmarks on data sizes 25 GB, 50 GB, 100 GB show BitalosDB outperforming RocksDB.
Storage Service
High‑Performance Data Consistency Based on Raft
Deeply optimized standard Raft synchronization: Bitalos‑Server leverages batch processing, full‑async I/O, and parallel transmission, achieving more than threefold write performance improvement over the standard Raft protocol.
Pre‑Election Technique for Raft
Standard election triggers a vote as soon as any follower’s heartbeat times out, which can affect write traffic even if the timeout is caused by transient network jitter.
Bitalos‑Server adds a pre‑election phase: when a follower’s heartbeat times out, it first attempts a pre‑election by contacting other followers to verify the leader’s status before launching a formal election.
Multi‑Cloud Multi‑Master Technology Based on CRDT
Background: Zuoyebang’s services run across multiple clouds; the KV store must provide low write latency and high availability. Multi‑master writes across clouds can cause conflicts that must be resolved.
Requirements: Idempotence (a☆a = a), Commutativity (a☆b = b☆a), Associativity (a☆(b☆c) = (a☆b)☆c).
Solution
Idempotence: Each write log in a single‑cloud cluster gets a Raft‑log‑id; cross‑cloud synchronization follows Raft‑log‑id order, ensuring idempotent updates.
Commutativity & Associativity: Stateless value updates (e.g., set, hset) use LWW‑Register semantics; stateful updates (e.g., incrby) use Counter semantics; collection types (hash, set, zset) use OR‑Set semantics.
Conclusion
As Zuoyebang continues to grow, the demand for highly available and high‑performance NoSQL databases rises. The team pursues extreme performance by innovating I/O architecture and storage algorithms, aiming for higher write/read throughput and lower latency. Further technical details and optimizations will be shared in future articles.
Zuoyebang Tech Team
Sharing technical practices from Zuoyebang
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.