Design and Practices of Qunar's Self‑Developed High‑Availability Message Middleware

This article shares Qunar's architecture and practical experience in designing a self‑developed high‑availability message middleware, covering its role in transaction processing, consistency guarantees, fault‑tolerance mechanisms, isolation, monitoring, and consumer design, and discusses trade‑offs and operational considerations.

Qunar Tech Salon
Qunar Tech Salon
Qunar Tech Salon
Design and Practices of Qunar's Self‑Developed High‑Availability Message Middleware

Qunar's infrastructure team introduced the design of their self‑developed message queue middleware, which underpins all transaction flows such as order and payment processing. The system was built in 2012 to address the need for reliable message‑driven architecture after splitting large monolithic services.

The middleware follows a simple three‑component model—producer, broker, and consumer—where each message is the smallest unit, enabling easy scaling via hash‑based partitioning. To guarantee consistency, the team uses a single‑transaction approach: order data and the corresponding message are persisted in the same database transaction, and the message is only sent after the transaction commits.

When a message fails to send, a background task retries delivery, ensuring that a successful business operation always results in a dispatched message. This design prioritizes reliability and stability over raw performance, acknowledging the high cost of message loss in e‑commerce.

Fault tolerance is achieved through multi‑region broker clusters, automatic failover to alternate data centers, and priority routing that prefers the local cluster but falls back when it becomes unavailable. The broker also supports dynamic quota enforcement and throttling to protect the system from abusive subjects.

Isolation is handled by assigning per‑subject quotas and dedicated thread pools or actor‑based dispatchers, preventing a single hot subject from exhausting resources. Governance features include dynamic quota adjustment, reliability levels, selective message replay, and detailed logging for troubleshooting.

Operational tools provide visual tracing of message flow, back‑tracking, replay, and monitoring of QPS and latency at the subject and consumer granularity. Full‑link tracing integrates with QTracer to follow a message across services.

On the consumer side, the team implemented health‑check based online/offline control, idempotency mechanisms (using Redis or MySQL), and best‑effort ordering guarantees. They also discussed strategies for maintaining order, such as state‑machine validation and version‑based filtering.

The article concludes with a Q&A covering topics like message deduplication, cross‑region disaster recovery, idempotency requirements, storage implementation (MySQL + Java), and recent architectural refinements.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

fault toleranceConsistency
Qunar Tech Salon
Written by

Qunar Tech Salon

Qunar Tech Salon is a learning and exchange platform for Qunar engineers and industry peers. We share cutting-edge technology trends and topics, providing a free platform for mid-to-senior technical professionals to exchange and learn.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.