Operations 22 min read

How Tencent Scaled Apache Pulsar to Process Billions of Messages Daily – Ops Lessons & Tuning Guide

This article details the design, deployment, and performance‑tuning techniques used by Tencent's TEG Data Platform to operate a massive Apache Pulsar cluster handling trillions of messages per day, covering server‑side parameters, client‑side configurations, common bottlenecks, and step‑by‑step troubleshooting advice.

Tencent Cloud Middleware

Jul 25, 2022

How Tencent Scaled Apache Pulsar to Process Billions of Messages Daily – Ops Lessons & Tuning Guide

Background

The Tencent TEG Data Platform operates a large‑scale Apache Pulsar deployment (two clusters named T‑1 and T‑2) that serves as the backbone for a high‑throughput, low‑latency messaging service. Producers and consumers run in Kubernetes pods on the internal STKE container platform, while Pulsar brokers, BookKeeper bookies and ZooKeeper ensembles are deployed on Cloud Virtual Machines (CVM, 64‑core, 256 GB RAM, 4 SSD).

Cluster topology (server‑side)

Broker count: >100 per cluster (internal version 2.8.1.2) with three Discovery services.

Bookie count: >100 per cluster, co‑located with brokers.

ZooKeeper ensemble: 5 nodes per cluster (version 3.6.3, SA2.4XLARGE32 instances).

Namespaces: 3 per cluster, each containing a single topic.

Partitions per topic: >100, allowing >8 K consumers per partition.

Replication factor: 2 (E=5, W=2, A=2).

Retention: 1 day.

The clusters handle hundreds of billions of 10 KB messages per day per cluster .

Client‑side parameters

Maximum producers per partition: ~150.

Maximum consumers per partition: ~10 000 (≈1 w).

Producer pods: ~150.

Consumer pods: ~10 000.

Clients use the Pulsar Go SDK (master branch) and are scheduled on STKE.

Performance tuning – Issue 1: Client production timeout (server‑side causes)

Large acknowledgment (ack) ranges – “confirmation holes”.

Pulsar‑io thread‑pool stalls.

Slow ledger switching.

BookKeeper‑io single‑thread bottlenecks.

Debug‑level logging overhead.

Uneven topic‑partition distribution.

Analysis & mitigation

Confirmation holes : Pulsar stores ack ranges per subscription. When many consumers acknowledge at different speeds, the number of ranges grows, increasing storage pressure and production latency. Mitigation: reduce consumer count, increase consumption speed, or lower the ack‑range persistence frequency (e.g., adjust acknowledgmentAtMostOnce settings).

Pulsar‑io thread stalls : The broker’s I/O thread pool may deadlock due to lock contention or unhandled Future exceptions. Monitor connections in CLOSE_WAIT state and increase numIoThreads in broker.conf (e.g., from 8 to 16) to enlarge the pool.

Ledger switch latency : Ledger creation is triggered by entry count, size or age thresholds. Slow switches often stem from ZooKeeper latency or GC pauses. Upgrade ZooKeeper or tune its JVM GC parameters (e.g., -XX:+UseG1GC, reduce MaxGCPauseMillis).

BookKeeper‑io bottleneck : Adjust BookKeeper thread‑pool size ( numIOThreads), and tune ensemble parameters E/QW/QA to match the write load.

Debug logging : Debug‑level logs generate massive I/O. Switch broker, bookie and client logging to INFO or ERROR in production.

Partition distribution : Pulsar maps bundles to brokers. An uneven bundle allocation overloads specific brokers. Re‑balance by increasing the bundle count or using the topic_count_equally_divide algorithm.

Performance tuning – Issue 2: Frequent client disconnect / reconnect

Root causes identified in the Data project:

Client timeout‑based disconnect/reconnect logic.

Coarse exception handling in the Go SDK (many server errors are treated as fatal).

Sequence‑ID mismatches in older Go SDK versions.

Massive consumer churn after topic‑partition changes.

Solutions

Ensure client pods have sufficient CPU resources; monitor top and cgroup metrics.

Upgrade to the latest Go SDK (v0.9.x or newer) where sequence‑ID handling and error classification are refined.

Reduce unnecessary consumer creation/destruction; if similar bugs appear in Java, upgrade the Java client as well.

Implement robust error handling: distinguish recoverable errors (e.g., ServerError_TooManyRequests) from fatal ones before closing the channel.

Performance tuning – Issue 3: ZooKeeper upgrade

The original ZooKeeper 3.4.6 suffered from ZOOKEEPER‑2044, causing intermittent leader election failures. Upgrading to ZooKeeper 3.6.3 eliminated the issue.

Comprehensive troubleshooting checklist

Cluster resource monitoring : Track CPU, memory, disk I/O of brokers, bookies and ZooKeeper; watch Java GC logs, especially Full GC pauses.

Client consumption health : Use pulsar-admin topics stats to check unackedMessages, backlog size and consumer lag.

Acknowledgment size : Run pulsar-admin topics stats-internal and inspect individuallyDeletedMessages for large ack ranges.

Thread pool status : Examine pulsar-io, bookkeeper-io and other executor threads with jstack and top -p <pid>.

Log analysis : Filter out DEBUG logs, search for exceptions, and correlate timestamps across client, broker and bookie logs.

Key diagnostic diagrams

Confirmation hole illustration:

Pulsar‑io thread stall example (CLOSE_WAIT connections):

Ledger switch latency snapshot:

Bundle distribution diagram:

Conclusion

The article documents a systematic approach to operating a massive Pulsar deployment, covering server‑side configuration, client‑side tuning, common failure modes, and a step‑by‑step troubleshooting workflow. Ongoing work includes further tuning of Pulsar read/write threads, entry cache sizes, and contributing improvements to the Pulsar Go SDK.

operations Performance Tuning Apache Pulsar Go SDK Large Scale Messaging

Written by

Tencent Cloud Middleware

Official account of Tencent Cloud Middleware. Focuses on microservices, messaging middleware and other cloud‑native technology trends, publishing product updates, case studies, and technical insights. Regularly hosts tech salons to share effective solutions.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.

Background

Cluster topology (server‑side)

Client‑side parameters

Performance tuning – Issue 1: Client production timeout (server‑side causes)

Analysis & mitigation

Performance tuning – Issue 2: Frequent client disconnect / reconnect

Solutions

Performance tuning – Issue 3: ZooKeeper upgrade

Comprehensive troubleshooting checklist

Key diagnostic diagrams

Conclusion

Tencent Cloud Middleware

How this landed with the community

Was this worth your time?

0 Comments

Performance tuning – Issue 1: Client production timeout (server‑side causes)

Performance tuning – Issue 2: Frequent client disconnect / reconnect

Performance tuning – Issue 3: ZooKeeper upgrade