Backend Development 13 min read

Design and Implementation of a Scalable Delay Message Service Using Apache BookKeeper

This article describes how the QMQ delay‑message service was refactored by separating business logic from storage, adopting Apache BookKeeper for high‑availability, zone‑aware disaster recovery, a configurable DNS resolver, a ZooKeeper‑based task coordinator, and a multi‑level sliding‑time‑bucket scheduler to achieve a stateless, elastic architecture.

Ctrip Technology
Ctrip Technology
Ctrip Technology
Design and Implementation of a Scalable Delay Message Service Using Apache BookKeeper

Background – QMQ’s original delay‑message service stored pending messages on local disks in a master‑slave architecture, which suffered from stateful scaling, manual failover, and lack of consistency coordination.

Pain Points – The service could not elastically scale, required manual master‑slave switching on failures, and lacked a consistency coordinator.

Solution Overview – The team introduced Apache BookKeeper as a distributed, highly‑available storage layer, separating the message business tier from storage, making the delay service stateless and container‑friendly.

BookKeeper Basics – BookKeeper provides a scalable, fault‑tolerant, low‑latency, strongly consistent storage service. It relies on a ZooKeeper ensemble for metadata and node discovery, Bookie nodes for data storage, and a “fat client” that directly communicates with both ZooKeeper and BookKeeper clusters.

Key Features

Entry, Ledger, Bookie, and Ensemble concepts define the data model.

Writes are replicated across an ensemble; reads follow the same order, ensuring consistency via the Last Add Confirmed (LAC) mechanism.

Fault tolerance is achieved through writer crash recovery, ensemble replacement, and bookie recovery.

Load balancing occurs automatically when new bookies are added.

Zone‑Aware Multi‑Center Disaster Recovery – By enabling a zone‑aware ensemble replacement policy, replicas are distributed across multiple availability zones. The policy requires that the ensemble size (E) be divisible by the replication factor (Qw) and that Qw exceeds the minimum number of zones.

Example configuration:

minNumOfZones = 2
desiredNumZones = 3
E = 6
Qw = 3
[z1, z2, z3, z1, z2, z3]

When a zone fails, the ensemble is rebuilt to satisfy the constraints while still preserving two‑zone redundancy.

Configurable DNS Resolver – To map bookie IPs to zones, a simple DNS resolver was implemented. Example Java code:

public class ConfigurableDNSToSwitchMapping extends AbstractDNSToSwitchMapping {
    private final Map
mappings = Maps.newHashMap();
    public ConfigurableDNSToSwitchMapping() {
        mappings.put("192.168.0.1", "/z1/192.168.0.1");
        mappings.put("192.168.1.1", "/z2/192.168.1.1");
        mappings.put("192.168.2.1", "/z3/192.168.2.1");
    }
    @Override
    public boolean useHostName() { return false; }
    @Override
    public List
resolve(List
names) {
        List
rNames = Lists.newArrayList();
        names.forEach(name -> {
            String rName = mappings.getOrDefault(name, "/default-zone/default-upgradedomain");
            rNames.add(rName);
        });
        return rNames;
    }
}

Stateless Refactoring – With storage externalized, the business layer becomes stateless. However, BookKeeper does not support shared writes to the same ledger, and readers have independent progress. To address this, a task coordinator based on ZooKeeper leader election was introduced to allocate storage shards to workers, ensuring exclusive read/write access and balanced distribution.

Multi‑Level Sliding Time‑Bucket Scheduler – The original design used many 10‑minute buckets, causing scalability and latency issues. A new scheduler uses four levels (L3w, L2d, L1h, L0m) with bucket sizes of 1 week, 1 day, 1 hour, and 1 minute respectively, reducing the total bucket count from >100 000 to 286 while preserving a two‑year scheduling horizon.

Each level delegates writes to a higher‑level scheduler when possible; otherwise it handles the write locally or passes it down. Near‑term buckets are pre‑loaded and forwarded to the next level, enabling fine‑grained (1‑minute) dispatch without loading large amounts of data into memory.

Future Plans – Deploy BookKeeper clusters on Kubernetes for easier scaling, improve governance tooling, and extend disaster‑recovery capabilities beyond intra‑region to cross‑region and hybrid cloud scenarios.

JavaZookeeperDelay QueueDistributed StorageBookKeeperElastic Architecture
Ctrip Technology
Written by

Ctrip Technology

Official Ctrip Technology account, sharing and discussing growth.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.