How Ctrip Built Hermes: A Deep Dive into Scalable Message Queue Architecture

This article examines Ctrip’s Hermes messaging system, tracing its evolution from a simple Mongo‑based queue to a broker‑centric, MySQL/Kafka hybrid architecture, and explains the design choices, performance optimizations, cluster management via lease‑based meta‑server, and lessons learned for building high‑throughput, low‑latency MQ solutions.

IT Architects Alliance
IT Architects Alliance
IT Architects Alliance
How Ctrip Built Hermes: A Deep Dive into Scalable Message Queue Architecture

Message Queue Advantages

Message queues decouple producers and consumers, enable asynchronous processing, absorb traffic spikes, and support fan‑out scenarios, which are essential for real‑time personalization and high‑throughput workloads in large enterprises.

Evolution of Ctrip’s MQ Architecture

Version 1.0 – MongoDB‑Based Queue

The initial design stored messages directly in MongoDB collections without a broker. While development cost was low, scaling required client upgrades, coordination relied on DB‑level find‑and‑modify, and features such as consumer groups were unavailable.

Figure 1: Simple MongoDB queue architecture
Figure 1: Simple MongoDB queue architecture

Version 2.0 – Introducing a Broker

A broker layer was added while still persisting messages in MongoDB. The broker operated in a master‑slave mode, using MongoDB heartbeats for liveness. This enabled consumer groups and centralized coordination, reducing client‑side complexity.

Figure 2: Architecture with broker and MongoDB
Figure 2: Architecture with broker and MongoDB

Current Architecture – Hybrid MySQL/Kafka Backend

The latest design adds a meta‑server for cluster coordination and splits storage between MySQL (for critical data) and Kafka (for high‑throughput, less critical streams). Brokers now communicate with the meta‑server via ZooKeeper and lease‑based locks.

Figure 3: Hybrid MySQL/Kafka architecture with meta‑server
Figure 3: Hybrid MySQL/Kafka architecture with meta‑server

Message Types: Kafka vs MySQL

Kafka offers ultra‑high throughput by writing to memory and relying on OS‑level flushing, but lacks built‑in features such as message replay, priority, and fine‑grained filtering. For mission‑critical streams, Ctrip stores messages in MySQL, allowing full control over reliability, monitoring, and custom queue features.

Building an Efficient MQ

Single‑Node Optimizations

Design tables with only an ID primary key and minimal indexes to keep inserts fast.

Use partitioned tables (by ID or date) and drop partitions for cheap data cleanup.

Batch inserts (e.g., rewriteBatchedStatements=true) can yield 5× performance gains.

Implement event‑driven polling to avoid constant DB scans.

Cache recent messages in memory to reduce DB read latency.

Scaling from Single Node to Cluster

Adding brokers behind a load balancer increases capacity but introduces ordering challenges. Topic partitioning solves this: each partition is handled by a single broker, preserving order within the partition while allowing parallel processing across partitions.

Lease‑Based Cluster Management

The meta‑server issues time‑bound leases to brokers and consumers, acting as a lightweight lock. Brokers renew leases to retain ownership of partitions; if a lease expires, the meta‑server rebalances the partition to another broker. Consumers obtain leases to claim partitions, enabling dynamic load‑balancing without ZooKeeper coordination for consumers.

Figure 4: Lease‑based partition management
Figure 4: Lease‑based partition management

Meta‑server HA is achieved via DNS round‑robin and internal leader election using ZooKeeper; clients query a stable domain name, receive any meta‑server instance, and are redirected to the current leader if necessary.

Figure 5: Meta‑server HA and routing
Figure 5: Meta‑server HA and routing

Conclusion

Key takeaways for building a high‑performance MQ include: keep the write path lightweight with batch inserts and minimal indexing; use partition‑sticky routing to preserve order while scaling; employ a lease‑based meta‑server for simple, reliable cluster coordination; and combine MySQL for critical streams with Kafka for massive, less‑critical workloads.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

Distributed SystemsarchitectureScalabilityBackend DevelopmentMessage QueueHermesCtrip
IT Architects Alliance
Written by

IT Architects Alliance

Discussion and exchange on system, internet, large‑scale distributed, high‑availability, and high‑performance architectures, as well as big data, machine learning, AI, and architecture adjustments with internet technologies. Includes real‑world large‑scale architecture case studies. Open to architects who have ideas and enjoy sharing.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.