Reinventing WeChat’s Distributed Storage with Paxos: Inside the Memory Cloud Upgrade

This article details how WeChat transformed its massive memory‑cloud storage by replacing the QuorumKV NWR protocol with a non‑lease Paxos design, optimizing PaxosLog, adopting DirectIO on HDDs, and implementing operational safeguards, resulting in dramatically lower latency, higher availability, and reduced failure rates.

WeChat Backend Team
WeChat Backend Team
WeChat Backend Team
Reinventing WeChat’s Distributed Storage with Paxos: Inside the Memory Cloud Upgrade

Background

WeChat’s QuorumKV is a distributed storage system that serves core backend services such as accounts, user information, social graph, and Moments. Inspired by Google MegaStore, the team redesigned the system as PaxosStore and migrated thousands of machines, achieving an order‑of‑magnitude improvement in first‑access success rate and disaster‑recovery capability.

Challenges with QuorumKV

QuorumKV implements an NWR (N=3, W/R=2) protocol that enforces version‑level quorum, but it suffers from two major issues: asynchronous replication of writes and unavailability of keys when a majority cannot be formed, leading to frequent failures under high write load.

Data is written once and replicated asynchronously.

If the write‑read quorum cannot be reached, the key becomes unavailable and requires repair.

These problems cause most daily failures, especially for write‑heavy modules, when machines go offline or version mismatches occur.

Adopting a Non‑Lease Paxos Design

To guarantee strong consistency while maximizing availability, the team selected a lease‑free Paxos protocol, avoiding the downtime introduced by lease‑based leader election. The new architecture is illustrated in Figure 3.

The PaxosStore includes two main components: PaxosCertain and PaxosKV. The PaxosKV component, used by the memory cloud, consists of 1,912 lines of rigorously tested code and powers multiple key‑value services.

PaxosLog Optimizations

Two key refinements were applied to PaxosLog (PLog):

PLog as DB : Merge the log and the database so that each key has a 1:1 relationship with a log entry, eliminating duplicate writes.

Keep only the latest LogEntry : Retain only the most recent entry, enabling log compaction on every write and simplifying catch‑up without full snapshots.

These changes reduce write latency, cut storage overhead, and allow fast log catch‑up.

The leader for each log position is a distinguished replica chosen alongside the preceding log position’s consensus value. The leader arbitrates which value may use proposal number zero. The first writer to submit a value to the leader wins the right to ask all replicas to accept that value as proposal number zero. All other writers must fall back on two‑phase Paxos.

Strong‑Consistency Read/Write Protocols

Read protocol: a majority of replicas broadcast their PLog state; if a majority reports a chosen entry, that entry is the latest. Write protocol: optimizations reduce one disk write, two message sends, and two receives, as shown in the performance comparison.

DirectIO on Mechanical Disks

A DirectIO storage layer was built to guarantee safe data persistence. By introducing a 4 KB BlockID, the system reuses disk space without deleting files, and controls merge write speed to avoid performance spikes.

Operational Enhancements

To keep all PLogs aligned, the system employs asynchronous catch‑up, a three‑level timeout queue for pending entries, and full‑data verification. The LeanerOnly mode allows a node to receive only chosen entries, avoiding participation in Paxos writes during Byzantine‑like failures.

Results

Stress tests (120 B values, read/write ratio 3.3:1) on 64 GB memory machines with HDDs show the new architecture reduces average latency and final failure rates compared to the old QuorumKV design, while using only six machines instead of nine.

During a half‑hour network outage, PaxosStore’s first‑access failure rate remained flat, whereas QuorumKV’s failure rate spiked, demonstrating superior disaster tolerance.

After migrating a city‑wide deployment, the first‑access success rate improved from five‑nines to six‑nines, confirming the reliability gains.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

Consistencydistributed storageWeChatPaxos
WeChat Backend Team
Written by

WeChat Backend Team

Official account of the WeChat backend development team, sharing their experience in large-scale distributed system development.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.