Backend Development 22 min read

How Ctrip Built Hermes: A Deep Dive into Scalable Message Queue Architecture

This article examines Ctrip’s Hermes messaging system, tracing its evolution from a simple Mongo‑based queue to a broker‑centric, MySQL/Kafka hybrid architecture, and explains the design choices, performance optimizations, cluster management via lease‑based meta‑server, and lessons learned for building high‑throughput, low‑latency MQ solutions.

IT Architects Alliance

Apr 26, 2021

How Ctrip Built Hermes: A Deep Dive into Scalable Message Queue Architecture

Message Queue Advantages

Message queues decouple producers and consumers, enable asynchronous processing, absorb traffic spikes, and support fan‑out scenarios, which are essential for real‑time personalization and high‑throughput workloads in large enterprises.

Evolution of Ctrip’s MQ Architecture

Version 1.0 – MongoDB‑Based Queue

The initial design stored messages directly in MongoDB collections without a broker. While development cost was low, scaling required client upgrades, coordination relied on DB‑level find‑and‑modify, and features such as consumer groups were unavailable.

Figure 1: Simple MongoDB queue architecture

Version 2.0 – Introducing a Broker

A broker layer was added while still persisting messages in MongoDB. The broker operated in a master‑slave mode, using MongoDB heartbeats for liveness. This enabled consumer groups and centralized coordination, reducing client‑side complexity.

Figure 2: Architecture with broker and MongoDB

Current Architecture – Hybrid MySQL/Kafka Backend

The latest design adds a meta‑server for cluster coordination and splits storage between MySQL (for critical data) and Kafka (for high‑throughput, less critical streams). Brokers now communicate with the meta‑server via ZooKeeper and lease‑based locks.

Figure 3: Hybrid MySQL/Kafka architecture with meta‑server

Message Types: Kafka vs MySQL

Kafka offers ultra‑high throughput by writing to memory and relying on OS‑level flushing, but lacks built‑in features such as message replay, priority, and fine‑grained filtering. For mission‑critical streams, Ctrip stores messages in MySQL, allowing full control over reliability, monitoring, and custom queue features.

Building an Efficient MQ

Single‑Node Optimizations

Design tables with only an ID primary key and minimal indexes to keep inserts fast.

Use partitioned tables (by ID or date) and drop partitions for cheap data cleanup.

Batch inserts (e.g., rewriteBatchedStatements=true) can yield 5× performance gains.

Implement event‑driven polling to avoid constant DB scans.

Cache recent messages in memory to reduce DB read latency.

Scaling from Single Node to Cluster

Adding brokers behind a load balancer increases capacity but introduces ordering challenges. Topic partitioning solves this: each partition is handled by a single broker, preserving order within the partition while allowing parallel processing across partitions.

Lease‑Based Cluster Management

The meta‑server issues time‑bound leases to brokers and consumers, acting as a lightweight lock. Brokers renew leases to retain ownership of partitions; if a lease expires, the meta‑server rebalances the partition to another broker. Consumers obtain leases to claim partitions, enabling dynamic load‑balancing without ZooKeeper coordination for consumers.

Figure 4: Lease‑based partition management

Meta‑server HA is achieved via DNS round‑robin and internal leader election using ZooKeeper; clients query a stable domain name, receive any meta‑server instance, and are redirected to the current leader if necessary.

Conclusion

Key takeaways for building a high‑performance MQ include: keep the write path lightweight with batch inserts and minimal indexing; use partition‑sticky routing to preserve order while scaling; employ a lease‑based meta‑server for simple, reliable cluster coordination; and combine MySQL for critical streams with Kafka for massive, less‑critical workloads.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Distributed Systems architecture Scalability Backend Development Message Queue Hermes Ctrip

Written by

IT Architects Alliance

Discussion and exchange on system, internet, large‑scale distributed, high‑availability, and high‑performance architectures, as well as big data, machine learning, AI, and architecture adjustments with internet technologies. Includes real‑world large‑scale architecture case studies. Open to architects who have ideas and enjoy sharing.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.