How Tencent Scaled Apache Pulsar to Process Billions of Messages Daily – Ops Lessons & Tuning Guide
This article details the design, deployment, and performance‑tuning techniques used by Tencent's TEG Data Platform to operate a massive Apache Pulsar cluster handling trillions of messages per day, covering server‑side parameters, client‑side configurations, common bottlenecks, and step‑by‑step troubleshooting advice.
Background
The Tencent TEG Data Platform operates a large‑scale Apache Pulsar deployment (two clusters named T‑1 and T‑2) that serves as the backbone for a high‑throughput, low‑latency messaging service. Producers and consumers run in Kubernetes pods on the internal STKE container platform, while Pulsar brokers, BookKeeper bookies and ZooKeeper ensembles are deployed on Cloud Virtual Machines (CVM, 64‑core, 256 GB RAM, 4 SSD).
Cluster topology (server‑side)
Broker count: >100 per cluster (internal version 2.8.1.2) with three Discovery services.
Bookie count: >100 per cluster, co‑located with brokers.
ZooKeeper ensemble: 5 nodes per cluster (version 3.6.3, SA2.4XLARGE32 instances).
Namespaces: 3 per cluster, each containing a single topic.
Partitions per topic: >100, allowing >8 K consumers per partition.
Replication factor: 2 (E=5, W=2, A=2).
Retention: 1 day.
The clusters handle hundreds of billions of 10 KB messages per day per cluster .
Client‑side parameters
Maximum producers per partition: ~150.
Maximum consumers per partition: ~10 000 (≈1 w).
Producer pods: ~150.
Consumer pods: ~10 000.
Clients use the Pulsar Go SDK (master branch) and are scheduled on STKE.
Performance tuning – Issue 1: Client production timeout (server‑side causes)
Large acknowledgment (ack) ranges – “confirmation holes”.
Pulsar‑io thread‑pool stalls.
Slow ledger switching.
BookKeeper‑io single‑thread bottlenecks.
Debug‑level logging overhead.
Uneven topic‑partition distribution.
Analysis & mitigation
Confirmation holes : Pulsar stores ack ranges per subscription. When many consumers acknowledge at different speeds, the number of ranges grows, increasing storage pressure and production latency. Mitigation: reduce consumer count, increase consumption speed, or lower the ack‑range persistence frequency (e.g., adjust acknowledgmentAtMostOnce settings).
Pulsar‑io thread stalls : The broker’s I/O thread pool may deadlock due to lock contention or unhandled Future exceptions. Monitor connections in CLOSE_WAIT state and increase numIoThreads in broker.conf (e.g., from 8 to 16) to enlarge the pool.
Ledger switch latency : Ledger creation is triggered by entry count, size or age thresholds. Slow switches often stem from ZooKeeper latency or GC pauses. Upgrade ZooKeeper or tune its JVM GC parameters (e.g., -XX:+UseG1GC, reduce MaxGCPauseMillis).
BookKeeper‑io bottleneck : Adjust BookKeeper thread‑pool size ( numIOThreads), and tune ensemble parameters E/QW/QA to match the write load.
Debug logging : Debug‑level logs generate massive I/O. Switch broker, bookie and client logging to INFO or ERROR in production.
Partition distribution : Pulsar maps bundles to brokers. An uneven bundle allocation overloads specific brokers. Re‑balance by increasing the bundle count or using the topic_count_equally_divide algorithm.
Performance tuning – Issue 2: Frequent client disconnect / reconnect
Root causes identified in the Data project:
Client timeout‑based disconnect/reconnect logic.
Coarse exception handling in the Go SDK (many server errors are treated as fatal).
Sequence‑ID mismatches in older Go SDK versions.
Massive consumer churn after topic‑partition changes.
Solutions
Ensure client pods have sufficient CPU resources; monitor top and cgroup metrics.
Upgrade to the latest Go SDK (v0.9.x or newer) where sequence‑ID handling and error classification are refined.
Reduce unnecessary consumer creation/destruction; if similar bugs appear in Java, upgrade the Java client as well.
Implement robust error handling: distinguish recoverable errors (e.g., ServerError_TooManyRequests) from fatal ones before closing the channel.
Performance tuning – Issue 3: ZooKeeper upgrade
The original ZooKeeper 3.4.6 suffered from ZOOKEEPER‑2044, causing intermittent leader election failures. Upgrading to ZooKeeper 3.6.3 eliminated the issue.
Comprehensive troubleshooting checklist
Cluster resource monitoring : Track CPU, memory, disk I/O of brokers, bookies and ZooKeeper; watch Java GC logs, especially Full GC pauses.
Client consumption health : Use pulsar-admin topics stats to check unackedMessages, backlog size and consumer lag.
Acknowledgment size : Run pulsar-admin topics stats-internal and inspect individuallyDeletedMessages for large ack ranges.
Thread pool status : Examine pulsar-io, bookkeeper-io and other executor threads with jstack and top -p <pid>.
Log analysis : Filter out DEBUG logs, search for exceptions, and correlate timestamps across client, broker and bookie logs.
Key diagnostic diagrams
Confirmation hole illustration:
Pulsar‑io thread stall example (CLOSE_WAIT connections):
Ledger switch latency snapshot:
Bundle distribution diagram:
Conclusion
The article documents a systematic approach to operating a massive Pulsar deployment, covering server‑side configuration, client‑side tuning, common failure modes, and a step‑by‑step troubleshooting workflow. Ongoing work includes further tuning of Pulsar read/write threads, entry cache sizes, and contributing improvements to the Pulsar Go SDK.
Tencent Cloud Middleware
Official account of Tencent Cloud Middleware. Focuses on microservices, messaging middleware and other cloud‑native technology trends, publishing product updates, case studies, and technical insights. Regularly hosts tech salons to share effective solutions.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
