Databases 20 min read

How ByteDance’s Abase Achieves Extreme High Availability in KV Storage

This article explains the evolution, architecture, and high‑availability solutions of ByteDance’s Abase KV storage system, detailing its multi‑write design, leader‑less approach, multi‑region deployment, consistency mechanisms, performance optimizations, and real‑world metrics that support billions of requests per second.

Volcano Engine Developer Services
Volcano Engine Developer Services
Volcano Engine Developer Services
How ByteDance’s Abase Achieves Extreme High Availability in KV Storage

Abase Overview

Abase is ByteDance’s online key‑value (KV) storage originally built for recommendation services in 2016, later expanding to support most of the company’s products such as search, advertising, e‑commerce, Douyin, Feishu, and more.

Evolution and Scale

From a single‑cluster high‑performance KV interface, Abase grew to a company‑wide online KV store, supporting over 90% of ByteDance’s KV needs, handling more than 5 × 10⁴ servers and billions of QPS with petabyte‑scale data.

Key characteristics include large capacity, high throughput, low latency, high availability, and easy scalability.

High‑Availability Challenges

Traditional master‑slave architectures suffer from write‑unavailability during leader failover (seconds) and cannot meet ByteDance’s millisecond‑level latency requirements. Slow nodes, hardware issues, and network anomalies further degrade availability.

Abase 2.0 Architecture and Solutions

Abase 2.0 adopts a leader‑less multi‑write design inspired by Dynamo, eliminating single‑point failures. It uses a three‑module structure: a central coordinator, data nodes, and a proxy layer.

Clusters span multiple regions and availability zones (PODs), ensuring that replicas are never co‑located in the same POD, thus tolerating room‑level failures.

Each replica runs on a dedicated CPU core with its own thread, and namespaces partition data into logical tables and partitions, each replicated multiple times.

Consistency and Conflict Resolution

Writes are timestamped with a globally unique hybrid logical clock, enabling a “last write wins” policy while supporting CRDT‑based conflict resolution for complex data structures.

Quorum settings are configurable: quorum = 2 ensures data is persisted on two replicas before acknowledging success, while quorum = 1 favors latency by acknowledging after a single replica write.

Key Technologies

Multi‑write architecture eliminates leader switch latency and masks slow nodes.

Hybrid logical clock timestamps provide globally unique ordering without relying on synchronized physical clocks.

Two‑layer KV engine separates multi‑version logs from a compacted single‑version store, using RocksDB or a ByteDance‑optimized variant.

In‑memory indexing for logs enables point‑lookups despite log‑only storage.

Performance Metrics and Q&A

Abase handles hundred‑billion‑level QPS with P99 latency under 50 ms (10 ms for high‑priority clusters) and stores petabyte‑scale data. Bottlenecks vary by workload; large values cause write amplification, while small values are CPU‑bound.

Future work includes exploring persistent memory (PMem) for low‑latency small‑value writes.

distributed-systemshigh availabilityKV storageByteDanceleaderless architecture
Volcano Engine Developer Services
Written by

Volcano Engine Developer Services

The Volcano Engine Developer Community, Volcano Engine's TOD community, connects the platform with developers, offering cutting-edge tech content and diverse events, nurturing a vibrant developer culture, and co-building an open-source ecosystem.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.