Cloud Native 18 min read

Tencent Cloud’s Secrets to Scaling Apache Pulsar: Stability & Performance Hacks

This article details Tencent Cloud's year‑long production experience with Apache Pulsar, covering why Pulsar was chosen over Kafka, deep dives into Ack hole handling, TTL/Backlog/Retention strategies, zk‑node and ledger leaks, cache optimizations, and concrete code snippets that illustrate the stability and performance improvements.

Tencent Cloud Middleware

Feb 15, 2023

Tencent Cloud’s Secrets to Scaling Apache Pulsar: Stability & Performance Hacks

At the QCon Beijing conference, Tencent Cloud middleware engineer Ran Xiaolong presented the topic "Cloud‑Native Message Streaming System Apache Pulsar in Large‑Scale Production" and shared a series of stability and performance optimizations applied to Pulsar in Tencent Cloud.

Why Pulsar Over Kafka?

Customers with many topics found Kafka costly because each topic required a separate cluster. Pulsar’s storage‑compute separation, tiered storage, and serverless Pulsar Functions allow a single cluster to handle millions of topics, dramatically reducing cost while supporting high scalability.

Practice 1: Impact of Ack Holes and Mitigation

When using Shared subscription or single‑message Ack, users may encounter Ack holes. Pulsar records these in the individuallyDeletedMessages collection, which uses open intervals for holes and closed intervals for acknowledged messages. Early Pulsar versions lacked Ack response, causing holes when the broker failed to process Ack.

Mitigation approaches include:

Accurately calculating Backlog Size (complex and rarely used).

Broker‑side active compensation: each ManagedCursor can retrieve its individuallyDeletedMessages set, and the broker pushes missing acknowledgments to the client.

The broker’s compensation mechanism relies on the Backlog strategy, which defines three actions when producer‑consumer gaps grow:

Producer Exception : notifies the producer of a problem.

Producer Request Hold : pauses the producer without an explicit error.

Consumer Backlog Eviction : discards oldest messages to keep the pipeline flowing.

Practice 2: TTL, Backlog, and Retention Strategies

Definitions:

TTL : messages not Acked within a configured time are auto‑Acked by the broker.

Backlog : the gap between messages produced and consumed.

Retention : how long acknowledged messages are kept on BookKeeper, measured per ledger.

When TTL and Retention are both set, the effective message lifecycle follows these rules:

if (TTL < Retention) {
    lifecycle = TTL + Retention;
} else {
    lifecycle = TTL;
}

The following snippet shows how Pulsar updates a cursor and triggers ledger trimming based on the new position:

void updateCursor(ManagedCursorImpl cursor, PositionImpl newPosition) {
    Pair<PositionImpl, PositionImpl> pair = cursors.cursorUpdated(cursor, newPosition);
    if (pair == null) {
        // Cursor removed meanwhile
        trimConsumedLedgersInBackground();
        return;
    }
    PositionImpl previousSlowestReader = pair.getLeft();
    PositionImpl currentSlowestReader = pair.getRight();
    if (previousSlowestReader.compareTo(currentSlowestReader) == 0) {
        return; // No change
    }
    if (previousSlowestReader.getLedgerId() != newPosition.getLedgerId()) {
        trimConsumedLedgersInBackground();
    }
}

Practice 3: Delayed Messages vs. TTL

In a real‑world case, a user set a 10‑day delay but a 5‑day TTL, causing all delayed messages to expire after five days. The original isEntryExpired method only checked the publish timestamp, ignoring the delay offset. Tencent contributed a PR that also checks the delay duration before applying TTL, preventing premature expiration of delayed messages.

public static boolean isEntryExpired(int ttlSeconds, long entryTimestamp) {
    return ttlSeconds != 0 && (System.currentTimeMillis() > entryTimestamp + TimeUnit.SECONDS.toMillis(ttlSeconds));
}

Practice 4: Admin API Block Optimization

Earlier Pulsar code mixed synchronous calls inside asynchronous paths, causing thread blockage and occasional broker restarts. Additional issues included:

Blocking Http Lookup due to sync‑async mixing.

Poor Web service performance from misuse of CompletableFuture.

Metadata Store thread‑pool pressure on ZooKeeper.

The team removed the Metadata Store thread‑pool, added service‑side listeners to pinpoint slow Web paths, and introduced a 30‑second timeout that throws an exception instead of hanging the entire data flow.

Practice 5: zk‑node Leak Cleanup

Large numbers of stale ZooKeeper nodes were observed, up to five times the expected count. The cleanup process:

List all topic names by reading the zk‑path hierarchy.

Use pulsar-admin to verify each topic’s existence in the cluster; missing topics indicate dirty data.

Backup ZooKeeper data before any deletion.

Practice 6: Bookie Ledger Leak

Even with a 30‑day Retention limit, some ledgers persisted for hundreds of days. The analysis showed that only Retention can trigger ledger deletion; certain CLI‑generated ledgers bypass Retention. The team recommends inspecting ledger metadata, matching ledgers to topics, and safely deleting orphaned ledgers, while preserving schema information.

Practice 7: Multi‑Level Cache Optimization

Pulsar’s original cache iterated over all segments for each read, and a segment overflow cleared the entire cache, causing performance spikes. The new approach uses an OHC + LRU strategy to keep cache size stable.

try {
    int size = cacheSegments.size();
    for (int i = 0; i < size; i++) {
        int segmentIdx = (currentSegmentIdx + (size - i)) % size;
        // check recent entries
    }
} catch (Exception e) { /* handle */ }

try {
    int offset = currentSegmentOffset.getAndAdd(entrySize);
    if (offset + entrySize > segmentSize) {
        currentSegmentIdx = (currentSegmentIdx + 1) % cacheSegments.size();
        currentSegmentOffset.set(alignedSize);
        cacheIndexes.get(currentSegmentIdx).clear();
        offset = 0;
    }
} catch (Exception e) { /* handle */ }

Conclusion and Outlook

The article shares Tencent Cloud’s best practices for improving Apache Pulsar stability, including Ack‑hole mitigation, TTL/Backlog/Retention coordination, zk‑node and ledger leak handling, and cache redesign. The team also contributes upstream patches for timeout retries, broker and Bookie OOM prevention, and improved session handling between BookKeeper and ZooKeeper.

cloud-native TTL Message Queue stability Apache Pulsar

Written by

Tencent Cloud Middleware

Official account of Tencent Cloud Middleware. Focuses on microservices, messaging middleware and other cloud‑native technology trends, publishing product updates, case studies, and technical insights. Regularly hosts tech salons to share effective solutions.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.