Backend Development 26 min read

How PhxQueue Achieves High‑Throughput, High‑Reliability Distributed Queuing with Paxos

PhxQueue, an open‑source, Paxos‑based distributed queue from WeChat, delivers at‑least‑once delivery, synchronous disk flushing, strict ordering, multi‑subscription, and high availability, outperforming Kafka in reliability and latency while maintaining comparable throughput, as demonstrated through detailed design, performance, and failover analyses.

WeChat Backend Team

Sep 12, 2017

How PhxQueue Achieves High‑Throughput, High‑Reliability Distributed Queuing with Paxos

Open source address: https://github.com/Tencent/phxqueue

PhxQueue is an open‑source, Paxos‑based distributed queue developed by WeChat that provides high availability, high throughput, and high reliability with At‑Least‑Once delivery, and is widely used internally for critical services such as WeChat Pay and the public platform.

Message Queue Overview

Message queues, as a mature asynchronous communication pattern, offer several advantages over synchronous communication:

Decoupling: prevents excessive API integration from jeopardizing system stability; improper usage by callers can overload providers, and mishandling by providers can degrade caller responsiveness.

Peak‑shaving and flow control: producers never block, and bursty messages are buffered in the queue while consumers read at their own pace.

Reuse: one publish can serve multiple subscribers.

Background of PhxQueue

Legacy Queue

The original distributed queue (referred to as the legacy queue) was a self‑developed core component of WeChat’s backend, providing decoupling, caching, and asynchronous capabilities across many business scenarios.

The legacy queue used Quorum NRW as its synchronization mechanism (N=3, W=R=2) with asynchronous disk flushing, balancing performance and availability.

New Requirements

As business grew, the legacy queue showed several shortcomings:

Asynchronous flushing raised data reliability concerns

For payment‑related services, data reliability is paramount. Most distributed queue solutions rely on synchronous replication plus asynchronous flushing, but we believe synchronous flushing is needed to further improve reliability.

Message ordering issues

Some services require strict ordering, which NRW cannot guarantee.

Additional problems included duplicate removal, load balancing, and other issues that motivated a new design.

Shortcomings of Existing Solutions

Kafka, a popular message queue in the big‑data domain, offers high throughput, automatic failover, and ordered enqueue/dequeue, but it falls short in scenarios demanding strong data reliability:

Conflict Between Performance and Synchronous Flushing

Enabling synchronous flushing (log.flush.interval.messages=1) dramatically reduces Kafka’s throughput due to SSD write amplification and limited effectiveness of producer batching in business‑level workloads.

SSD write amplification

Typical business messages are around 1 KB, while SSDs write in 4 KB pages. Flushing sub‑4 KB messages causes the physical write size to be several times larger than the logical message size, wasting bandwidth.

Producer batch ineffective for business workloads

In business scenarios each request has its own context, making batching difficult. Even with a proxy layer, the large number of proxy nodes prevents effective batch reduction, so write amplification remains.

Kafka Replica Synchronization Design Issues

Kafka’s replica synchronization relies on the ISR (in‑sync replica) list. If a follower fails or lags, the leader removes it from ISR, which can degrade availability:

Broker failover success rate drops sharply

With three replicas, a broker failure can take 1/3 of leaders and followers offline, causing read/write success rates to drop to zero until a new leader is elected and the ISR is updated.

Synchronous latency depends on the slowest node

In synchronous replication, the system must wait for acknowledgments from all nodes.

Comparing Kafka’s replica model with Paxos, we conclude that Paxos offers a better synchronization approach.

Thus, we rebuilt the legacy queue using Paxos for synchronous logic and added multiple optimizations, resulting in PhxQueue.

PhxQueue Introduction

PhxQueue is widely used inside WeChat for services such as WeChat Pay and the public platform, handling billions of enqueues per day with peak rates of hundreds of millions per minute.

Its design goals are high data reliability, high availability, and high throughput, while supporting common queue features.

Key features include:

Synchronous flushing guaranteeing no data loss and built‑in real‑time reconciliation.

Strict ordering of enqueue and dequeue.

Multiple subscriptions.

Dequeue rate limiting.

Message replay.

Parallel scalability of all modules.

Batch flushing and synchronization in the storage layer for high throughput.

Support for multi‑datacenter deployment in the storage layer.

Automatic storage‑layer failover and load balancing.

Consumer automatic failover and load balancing.

PhxQueue Design

Overall Architecture

PhxQueue consists of five modules:

Store – Queue Storage

The Store module uses the PhxPaxos library to achieve replica synchronization via the Paxos protocol, providing linearizable reads and writes as long as a majority of nodes are operational.

Synchronous flushing is enabled by default without sacrificing performance.

Each Paxos group has a master that handles reads/writes; masters are dynamically balanced across Store nodes, and failover automatically promotes another node to master.

Producer – Message Producer

Producers route messages based on a key; messages with the same key are stored in the same queue, ensuring that dequeue order matches enqueue order.

Consumer – Message Consumer

Consumers pull messages in batches from the Store and can process them concurrently using multiple coroutines. They are implemented as services where users provide callbacks to handle messages per topic and handler type.

Scheduler – Consumer Manager (optional)

Scheduler collects global load information from Consumers and performs failover and load balancing. If not needed, it can be omitted, and Consumers use configured weights to determine queue handling relationships.

When deployed, the Scheduler leader maintains heartbeats with all Consumers, adjusting their assignments as needed. If the Scheduler leader fails, a distributed lock service elects a new leader; during this period, Consumer failover and load balancing are temporarily unavailable, but normal consumption continues.

Lock – Distributed Lock (optional)

Lock provides a generic distributed lock service. It is required when Scheduler is deployed to elect the Scheduler leader and to prevent multiple Consumers from processing the same queue simultaneously.

If the business is tolerant of duplicate consumption, Lock can be omitted.

Store Replication Process

The Store replicates using the PhxPaxos protocol, which consists of three layers: the app layer handling business requests, the Paxos layer executing the consensus algorithm, and the state machine layer updating business state.

The app layer proposes Paxos entries; the Paxos layer reaches agreement across nodes; the state machine consumes the agreed log to transition state, ensuring strong consistency.

Mapping queue concepts to Paxos:

Queue entries correspond directly to Paxos logs; the state machine stores the log sequence.

Strictly increasing instance IDs serve as queue offsets.

Data preceding the read offset can be safely deleted, analogous to checkpoints.

Thus, the queue state machine aligns well with Paxos.

Store Group Commit – Efficient Flushing and Replication

Unoptimized Paxos does not solve write amplification caused by synchronous flushing, and its replication efficiency is lower than Kafka’s streaming batch approach.

Kafka batches replication per replica, while Paxos synchronizes per log entry, incurring one RTT plus one flush per entry.

In multi‑DC deployments with 4 ms ping latency, a single Paxos group’s theoretical max TPS is only 250.

We deploy multiple Paxos groups and use Group Commit to address both write amplification and throughput limitations.

Multiple Paxos groups act as Group Commit units; each group aggregates enqueues from several queues over a time window or size threshold before triggering a single Paxos synchronization and synchronous flush, reducing overhead.

Compared with Kafka’s producer batching, this storage‑layer aggregation offers two benefits:

Business code does not need to manage batching.

Aggregating at the Paxos group level yields better batch efficiency than higher‑level aggregation.

PhxQueue vs. Kafka Comparison

We compare PhxQueue and Kafka across design, performance, and storage‑layer failover.

Design Comparison

Both systems share a similar architecture, but PhxQueue introduces unique design choices to improve reliability and latency under the same data‑reliability assumptions (minority node failures do not cause data loss).

Performance Comparison

Test Environment

CPU: 64 x Intel(R) Xeon(R) CPU E5-2620 v3 @ 2.40GHz
Memory: 64 GB
Network: 10 Gigabit Ethernet
Disk: SSD Raid 10
Cluster Nodes: 3
Ping: 1ms

Test Benchmarks and Configuration

Test Results

With Producer Batch enabled:

Without Producer Batch:

In these scenarios, PhxQueue’s bottleneck is CPU usage (70‑80%).

Summary

PhxQueue’s performance is on par with Kafka.

For the same QPS, PhxQueue’s average latency is slightly better because it does not wait for the slowest node.

When Producer Batch is disabled, PhxQueue can achieve up to twice Kafka’s performance under synchronous flushing, thanks to storage‑layer batching that reduces write amplification.

Storage‑Layer Failover Comparison

We evaluate the impact on overall throughput when a storage node is killed.

Kafka

During failover, enqueue success rate drops to 0‑33%.

Failover duration is governed by the controller lease (default 10 s).

Test procedure: increase replica.lag.time.max.ms to 60 s, kill Broker 0, and observe ISR changes.

Phase 1 (before kill):
Topic: test-dis-p100 Partition: 96 Leader: 0 Replicas: 0,1,2 Isr: 1,0,2
...
Phase 2 (kill Broker 0, 8 s):
... Isr unchanged (writes blocked)
Phase 3 (≈1 min):
... ISR shrinks, writes succeed only on remaining leader
Phase 4 (recovery):
... ISR restored, throughput resumes

Kafka’s leader election and ISR management cause a temporary loss of write capability, leading to severe throughput degradation.

PhxQueue

During failover, enqueue success rate drops only to 66%.

Failover duration is controlled by a lease (default 5 s).

Enabling the “queue‑retry” feature keeps success rate above 90% during failover.

Test procedure: extend Store master lease to 60 s, kill Store 0, and measure producer success rates.

Without queue‑retry:

>>> kill store 0 here (success rate drops)
-- total: 192323
-- qps: 19203.49
-- retcode 0: 65.47% success (66% success rate)
... (details omitted)
-- after recovery: 100% success

With queue‑retry enabled:

>>> kill store 0 here (success rate drops)
-- total: 134752
-- qps: 13473.85
-- retcode 0: 94.56% success (94% success rate)
... (details omitted)
-- after recovery: 100% success

Overall Observation

During storage failover, PhxQueue maintains 66‑100% success, whereas Kafka drops to 0‑33%.

With queue‑retry, PhxQueue’s success stays above 90%.

Both systems eventually restore full success after master election.

Conclusion

PhxQueue makes extensive efforts at the storage layer: it implements automatic master switching while preserving linearizability, ensures high availability during switches, and achieves throughput comparable to asynchronous flushing.

It also provides many practical queue features such as strict ordering, multi‑subscription, rate limiting, and message replay, suitable for a wide range of business scenarios.

PhxQueue is now used at massive scale within WeChat and has been open‑sourced. We will keep the open‑source version in sync with the internal implementation and welcome feedback from users.

Open source address: https://github.com/Tencent/phxqueue

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

distributed systems high availability Kafka Performance comparison Paxos

Written by

WeChat Backend Team

Official account of the WeChat backend development team, sharing their experience in large-scale distributed system development.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.