Designing Scalable Asynchronous Message Queues: Ctrip’s Hermes Architecture Deep Dive

This article examines Ctrip's Hermes asynchronous messaging system, detailing its evolution from a simple Mongo‑backed queue to a broker‑centric, partitioned architecture with lease‑based cluster management, and shares practical techniques for building high‑performance, low‑latency message queues in large‑scale distributed environments.

21CTO
21CTO
21CTO
Designing Scalable Asynchronous Message Queues: Ctrip’s Hermes Architecture Deep Dive

Distributed systems are a hot topic across the internet industry, and Ctrip’s senior engineers shared their experience building a large‑scale asynchronous messaging system called Hermes.

Advantages of Message Queues

Message queues decouple services, enable asynchronous processing, absorb traffic spikes, and support fan‑out scenarios, which is essential for personalized real‑time demands.

Ctrip MQ Architecture Evolution

Version 1.0 stored messages directly in MongoDB without a broker, leading to heavy client coordination and poor scalability. Version 2 introduced a broker (master‑slave) still using MongoDB for coordination. The current architecture adds a meta‑server for cluster coordination, uses MySQL and Kafka as storage backends, and separates producers, brokers, and consumers.

Two Message Types

Kafka‑based storage offers high throughput but lacks features like message replay and priority. Critical messages are stored in MySQL, allowing rich queue features and fine‑grained monitoring, while less critical, high‑volume logs use Kafka.

How to Build an Efficient MQ

Single‑machine optimization focuses on fast writes (simple primary‑key‑only tables, batch inserts) and low latency reads (memory caching, event‑driven triggers). From single node to cluster adds load‑balanced brokers, topic partitions, and ensures ordering within partitions.

Lease‑based cluster management uses a meta‑server that grants time‑limited leases to brokers and consumers, enabling dynamic routing, automatic failover, and balanced partition assignment without relying on ZooKeeper for consumers.

Summary

Key takeaways include optimizing message write paths with batch operations, minimizing delivery latency via long‑polling and in‑memory caches, and simplifying cluster coordination through lease mechanisms that provide flexible, low‑overhead control over brokers, topics, and partitions.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

Distributed SystemsBackend ArchitectureScalabilityMessage QueueHermeslease management
21CTO
Written by

21CTO

21CTO (21CTO.com) offers developers community, training, and services, making it your go‑to learning and service platform.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.