Design and Implementation of Pegasus: Xiaomi’s Distributed Key‑Value Store
This article explains why Xiaomi built Pegasus to replace HBase, describes its architecture—including MetaServer, ReplicaServer, partitioning, multi‑replica design and the PacificA consensus algorithm—covers implementation challenges such as load balancing, consistency, latency, testing, and outlines current status and future plans.
Pegasus is a distributed key‑value storage system open‑sourced by Xiaomi to address the stability and performance limitations they encountered with HBase, which relied heavily on Zookeeper for node health checks and stored data on an external DFS.
The design goals of Pegasus are to retain HBase’s strong points—consistent view and dynamic scaling—while improving latency, availability, and scalability. The system adopts a centralized management model with a MetaServer (similar to HBase’s Master) and ReplicaServers (similar to RegionServers), uses hash‑based partitioning, and stores data directly in ReplicaServers rather than a third‑party DFS.
Key architectural differences include moving heartbeat management from Zookeeper to MetaServer, maintaining three replicas per partition across different ReplicaServers, and initially keeping MetaServer high‑availability on Zookeeper with a future plan to replace it with Raft. Consistency across replicas is achieved with the PacificA algorithm, which the authors claim is easier to implement than Raft.
Write requests are routed through MetaServer to locate the key’s partition, then sent to the primary ReplicaServer, which synchronizes the update to secondaries before acknowledging the client. Read requests go directly to the primary, which holds the full data.
Implementation challenges discussed include partition schema selection (hash vs. sorted), load‑balancing primary and secondary roles across servers, handling Zookeeper failures, and ensuring strong consistency while minimizing latency. The authors describe a deterministic testing framework inspired by Microsoft’s rDSN that injects pseudo‑random I/O errors and controls execution order to reproduce rare bugs.
Pegasus is written in C++ for performance reasons, with considerations for possible Rust adoption. The system employs an event‑driven, lock‑free architecture where all state‑changing operations are queued and processed by a single thread per consistency object, using thread pools to limit thread creation.
Current status: Pegasus has been stable in Xiaomi’s production for about a year, serving multiple services. Future work includes exposing cold‑backup, partition splitting, and further multi‑datacenter redundancy features. The project, its documentation, and related research papers are available on GitHub.
Qunar Tech Salon
Qunar Tech Salon is a learning and exchange platform for Qunar engineers and industry peers. We share cutting-edge technology trends and topics, providing a free platform for mid-to-senior technical professionals to exchange and learn.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.