Cloud Native 16 min read

Why Tencent Chose Pulsar: Lessons from Cloud‑Native Messaging at Scale

This article reviews the evolution of message queues, explains why Pulsar was adopted at Tencent, details the challenges of open‑source solutions, and presents the architectural design, performance and observability optimizations, plus a real‑world case study and future roadmap.

Tencent Cloud Middleware
Tencent Cloud Middleware
Tencent Cloud Middleware
Why Tencent Chose Pulsar: Lessons from Cloud‑Native Messaging at Scale

Message Queue Evolution

The open‑source messaging ecosystem started with ActiveMQ (2003) and later introduced Kafka, RocketMQ, RabbitMQ, and finally Apache Pulsar (2012). Kafka dominates offline batch processing, RocketMQ excels at low‑latency online streams, RabbitMQ offers simplicity with limited scalability, and Pulsar was designed to combine online and offline workloads while supporting massive multi‑tenant deployments.

Pulsar Design Goals

Cloud‑Native Architecture

Separate compute (Brokers) and storage (BookKeeper) to enable independent scaling on Kubernetes or other container platforms.

Cross‑region and cross‑rack replication for higher fault tolerance.

Operator‑based zero‑code deployment for rapid provisioning.

Multi‑Tenant & Massive Topic Support

Namespace and topic‑level ACLs allow many teams to share a single cluster safely.

Architecture is built to handle millions of topics (e.g., >30 k partitions per cluster).

Unified Stream‑Batch Model

All‑in‑one broker cluster simplifies operations compared with separate Kafka and RocketMQ stacks.

Built‑in Kafka connectors ease migration from existing Kafka deployments.

Typical Pulsar Deployment

A minimal production cluster consists of:

2 Brokers (compute nodes)

3 ZooKeeper nodes (metadata coordination)

3 BookKeeper nodes (durable storage)

Clients can use Java, Go, C++, or Node.js SDKs; the most common are Java and Go.

Challenges in Open‑Source Messaging

Control‑Plane Issues

Fine‑grained permission management for metadata (topics, namespaces) is complex.

When a cluster exceeds ~30 k partitions, policy updates become unstable.

Lack of audit trails for metadata changes makes troubleshooting difficult.

Data‑Plane Issues

Producers throttled by the broker receive no immediate error, causing silent back‑pressure.

Many configuration items require broker restarts, reducing availability.

ListenerName‑based network configuration is inflexible for scaling.

Message gaps (holes) are not automatically recovered.

Seamless migration between overloaded clusters is missing.

Observability Issues

Exporting metrics for tens of thousands of topics creates large GC pressure on brokers.

No end‑to‑end message traceability across producer‑broker‑consumer.

Complex alert rules hinder proactive detection of issues such as backlog growth.

Engineering Solutions Implemented

Control‑Plane Enhancements

Performance optimization: Refactored broker metadata handling and added stress‑testing pipelines to sustain >1000 TPS metadata operations.

Stability monitoring: Built a Zookeeper growth monitor that compares actual node size against expected increments and auto‑corrects anomalies.

Metadata tagging: Added storage tags to enable post‑mortem analysis of metadata objects.

Resource traceability: Every metadata operation is logged and queryable via a management API.

Data‑Plane Improvements

Active push of gap messages: brokers detect missing sequence numbers that persist beyond a configurable timeout and push the missing messages to consumers.

Fail‑fast throttling: producers receive immediate TooManyRequests errors when the broker applies rate limits.

Wildcard domain allocation replaces static ListenerName, allowing dynamic DNS‑based routing and easier scaling.

Global OHC+LRU cache for BookKeeper client buffers; applied bug‑fixes to improve BookKeeper stability.

Dynamic configuration via ZooKeeper and Apollo: configuration changes (e.g., load‑balancing ratios) take effect without broker restarts.

Message trace feature records producer IP, message ID, publish latency, consumer ID, and ack time, providing full lifecycle visibility.

Observability Enhancements

Push‑based metric aggregation: brokers batch topic metrics and push them to Prometheus, reducing scrape size from dozens of MB to a few hundred KB per interval.

Full‑link tracing: integrates with distributed tracing systems to correlate producer, broker, and consumer spans.

Fine‑grained alert templates: built‑in alert rules for backlog, unacknowledged messages, and Zookeeper growth.

Automated health inspection: periodic health checks evaluate composite metrics and notify operators proactively.

Case Study – King’s Camp App

The King’s Camp mobile game streams user login, team, room, highlight, and kill events to Pulsar. Key configuration details:

Topic naming uses environment suffixes (e.g., login-prod, login-test).

Message TTL is set to 2 hours because expired data does not need to be retained.

Client SDK: Go Pulsar library ( github.com/apache/pulsar-client-go).

Throughput reaches ~100 k messages / second for both production and consumption.

Using TDMQ Pulsar’s peak‑shaving and shared subscription modes reduces RPC overhead and smooths bursty traffic.

Future Roadmap for TDMQ Pulsar

Support message‑trace queries by business ID.

Enrich operational dashboards with business‑oriented metrics.

Provide diagnostic tools for producer‑consumer latency and backlog analysis.

Optimize handling of large‑delay (long‑latency) messages.

Introduce gray‑release capable dynamic broker configuration (e.g., feature flags via Apollo).

cloud-nativeMessage QueueTencentApache Pulsar
Tencent Cloud Middleware
Written by

Tencent Cloud Middleware

Official account of Tencent Cloud Middleware. Focuses on microservices, messaging middleware and other cloud‑native technology trends, publishing product updates, case studies, and technical insights. Regularly hosts tech salons to share effective solutions.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.