Cloud Native 36 min read

How RocketMQ Evolved Its High‑Availability Architecture for Cloud‑Native Deployments

This article examines RocketMQ's high‑availability evolution—from early master‑slave and Raft‑based designs to the v5 DLedger fusion model—detailing replica groups, data sharding, election mechanisms, replication strategies, metric trade‑offs, log‑divergence handling, controller roles, heartbeat optimizations, and comparisons with Kafka and Pulsar, all illustrated with diagrams and code snippets.

Alibaba Cloud Native

Jun 26, 2023

How RocketMQ Evolved Its High‑Availability Architecture for Cloud‑Native Deployments

Background and Motivation

Distributed systems inevitably encounter network failures, node crashes, and disk errors, so services must provide redundancy and fault tolerance. RocketMQ is a critical component in logging, analytics, online transactions, and financial systems, and different deployment environments impose varying cost and reliability requirements.

Evolution of High‑Availability Designs

In RocketMQ v4 two mainstream HA designs existed:

Cold‑standby primary‑backup architecture (no automatic failover).

Raft‑based multi‑replica architecture (majority‑quorum).

The cold‑standby mode suffered low backup‑node utilization and special‑message availability issues, while Raft’s strict majority voting limited flexible cross‑datacenter deployments. RocketMQ v5 introduced the DLedger Controller, merging the advantages of both approaches and modularizing election logic.

Replica Groups and Data Sharding

A ReplicaSet (replica group) contains multiple copies of the same data to avoid loss. The granularity can be file‑level (e.g., HDFS) or partition/queue‑level (e.g., Kafka). RocketMQ assigns each Broker a ClusterName and BrokerName; brokers sharing the same name form a replica group. Each broker gets a unique BrokerId (0 is the leader/primary). Election is equivalent to granting one replica an exclusive write lock.

Metrics for Availability

RTO (Recovery Time Objective): time from outage to service restoration.

RPO (Recovery Point Objective): data loss window measured by the last successful backup.

SLA (Service‑Level Agreement): contractual service quality, higher SLA usually means higher cost.

Replica Count Trade‑offs

Single replica : lowest cost, but if the node fails all new writes are lost and existing data may become unavailable.

Two replicas : balances cost and performance; read traffic can be offloaded to the backup, but without a quorum the system cannot elect a new leader automatically.

Three or five replicas : provides strong self‑healing via Paxos/Raft, but incurs high storage cost and may struggle with write‑through requirements in some scenarios.

Log vs. Message Replication

To guarantee eventual consistency, RocketMQ can use either logical or physical replication:

Logical replication : messages are routed between systems (e.g., connectors to Elasticsearch or Flink). It offers high flexibility but adds overhead for service discovery, heartbeats, and may impact performance as topics increase.

Physical replication : akin to MySQL’s redo log, the primary appends messages to a commit log and streams them to followers. This approach uses a single log stream, simplifies consistency checks, and improves performance.

Log Divergence Scenarios

When a follower lags or a network partition occurs, six possible offset relationships arise (illustrated below). The article enumerates each case (1‑1, 1‑2/2‑2, 1‑3/2‑3, 3‑3) and explains how the system should handle them, including the concept of “unclean election” when a follower’s log is ahead of the leader.

Lease, Node Identity, and Central vs. Decentralized Control

Traditional centralized controllers (e.g., Zookeeper, etcd) manage leader election but add complexity. RocketMQ can operate without a central lease by embedding a lightweight Controller (or “sentinel Q”) that monitors heartbeats via Kubernetes liveness/readiness probes. The article describes a scenario where Q mistakenly marks the current leader as failed, leading to a temporary dual‑leader situation caused by network partition.

Raft requires the node with the latest committed log to become leader; Multi‑Paxos does not impose this restriction.

Controller‑Based Identity Assignment Process

Multiple Controllers elect a primary and redirect Broker requests.

When a Broker starts, it registers as a backup with the Controller.

The Broker receives a BrokerMemberGroup describing the replica set.

If assigned as backup, it connects to the leader and synchronizes logs; if assigned as leader, it waits for backups to connect and announces itself to the NameServer.

The leader periodically reports water‑mark differences and replication speed to the Controller.

RocketMQ’s design keeps the “single leader per term” invariant while using Leader Lease and Leader Stickiness to avoid dirty reads and frequent leader churn.

Log Continuity Guarantees

Raft validates log continuity before committing entries; Multi‑Paxos allows gaps and relies on explicit commit messages. Both mechanisms are reflected in RocketMQ’s confirm offset handling.

Epoch‑StartOffset Log Truncation Algorithm

To locate the fork point efficiently, RocketMQ compares Epoch‑StartOffset maps of leader and follower. The algorithm iterates the epoch map in descending order, finds the first matching epoch with identical start offset, and returns the minimum of the two end offsets as the consistent truncation point.

public long findLastConsistentPoint(final EpochStore compareEpoch) {
    long consistentOffset = -1L;
    final Map<Long, EpochEntry> descendingMap = new TreeMap<>(this.epochMap).descendingMap();
    for (Map.Entry<Long, EpochEntry> curLocalEntry : descendingMap.entrySet()) {
        final EpochEntry compareEntry = compareEpoch.findEpochEntryByEpoch(curLocalEntry.getKey());
        if (compareEntry != null && compareEntry.getStartOffset() == curLocalEntry.getValue().getStartOffset()) {
            consistentOffset = Math.min(curLocalEntry.getValue().getEndOffset(), compareEntry.getEndOffset());
            break;
        }
    }
    return consistentOffset;
}

Data Replay and Log Truncation After Failure

When a leader crashes after acknowledging a batch of messages, the new leader must not discard uncommitted entries. RocketMQ follows the “Leader Completeness” rule: any entry that was committed by the previous leader in its term must be retained. The new leader applies a larger version to commit pending entries, ensuring state‑machine safety.

Metadata Changes and Replication Modes

Metadata (e.g., topic creation/deletion) is stored in memory and synchronized every few seconds. Asynchronous metadata sync can cause temporary inconsistencies, similar to MySQL’s DDL lock issues. RocketMQ offers configurable replication modes via brokerRole (ASYNC_MASTER, SYNC_MASTER, SLAVE) and supports hybrid 2‑3 writes where one replica is synchronous and others asynchronous.

TotalReplicas : total number of replicas.

InSyncReplicas : required acknowledgments for a message.

MinInSyncReplicas : minimum sync replicas before fast‑fail.

AllAckInSyncStateSet : whether all members of the SyncStateSet must ack.

enableAutoInSyncReplicas : auto‑downgrade when live replicas drop.

Lightweight Heartbeat and Fast Isolation

RocketMQ v5 replaces the 30‑second broker registration interval with a 1‑second lightweight heartbeat, allowing the NameServer to detect and isolate failed brokers within seconds instead of minutes.

Comparison with Other Messaging Systems

Key differences include:

Controller : Kafka’s controller is heavyweight and tied to partition‑level ISR; RocketMQ’s controller operates at replica‑group granularity and can be embedded in NameServer.

Data Node Granularity : Kafka replicates at partition level, RocketMQ at broker level, Pulsar at journal shard level.

Write Path : Kafka and RocketMQ use a Y‑shaped write (leader then followers); Pulsar uses a star‑shaped write directly to bookies.

Future Directions

RocketMQ continues to refine edge‑case handling—node failures, network partitions, log consistency, and latency‑reliability trade‑offs. Emerging ideas such as RDMA‑based replication protocols are being explored to further improve performance while preserving the system’s strong consistency guarantees.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

distributed systems Cloud Native High Availability replication RocketMQ Raft DLedger

Written by

Alibaba Cloud Native

We publish cloud-native tech news, curate in-depth content, host regular events and live streams, and share Alibaba product and user case studies. Join us to explore and share the cloud-native insights you need.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.