Operations 11 min read

Scaling Apache Pulsar on Tencent Cloud: Multi‑Network Access, Cluster Migration & HA Tips

This article details Tencent Cloud engineers' technical solutions for large‑scale Apache Pulsar deployments, covering multi‑network access challenges, a routing‑addressing redesign, product deployment models, a four‑step cluster migration process with subscription‑progress compensation, and high‑availability best practices such as rack‑aware and cross‑AZ replica distribution.

Tencent Cloud Middleware

Oct 22, 2024

Multi‑Network Access

In cloud environments three network types—internal, VPC, and public—must access Pulsar clusters. The traditional solution used AdvertisedListeners + ListenerName to map each IP address, which caused large per‑broker mapping files, configuration drift, and high maintenance overhead.

Routing‑Addressing Redesign

A LookupService backed by a database centralizes network‑to‑broker address resolution. Brokers no longer handle IP mapping, which simplifies configuration, reduces operational complexity, and eases multi‑cluster management and migration.

Deployment Models

TDMQ Pulsar supports three typical broker/bookie configurations:

Shared broker & shared bookie – lowest cost but limited isolation and stability.

Exclusive broker & exclusive bookie – full isolation and stability, higher resource consumption.

Exclusive broker & shared bookie – a trade‑off between cost and reliability.

Cluster Migration Procedure

When scaling, migration from the shared model to exclusive models follows four steps:

Metadata synchronization.

Data synchronization (geo‑replication).

Subscription‑progress synchronization (geo + compensation).

Cluster switch (unload + address adjustment).

Pulsar subscription progress consists of MarkDeletePosition and IndividuallyDeleteMessages. Geo‑only sync copies only MarkDeletePosition, leaving progress incomplete and potentially causing duplicate message processing.

Compensation Mechanism

When replicating messages to the target cluster, embed the original cluster’s message ID in the message metadata.

Synchronize both MarkDeletePosition and IndividuallyDeleteMessages to the target cluster.

After consumers switch, filter out messages whose embedded IDs indicate they have already been consumed in the source cluster.

This eliminates incomplete progress and prevents duplicate consumption.

High‑Availability Practices

Beyond single‑point‑failure protection, Pulsar deployments require zone‑level fault tolerance and cross‑region resilience. Storage strategies focus on replica placement.

Rack Awareness

Distribute BookKeeper replicas across different racks so that a rack failure does not affect data availability.

Cross‑AZ Distribution

Place replicas in multiple availability zones. Configure replica count ( w) and acknowledgment count ( a) such that during recovery at least w‑a+1 healthy replicas remain, guaranteeing data safety.

Sticky Reads and Read Strategy

When a replica node fails, enable sticky reads or adjust the read strategy to avoid read‑backlog and performance degradation.

Conclusion

Ongoing work aims to further optimize Apache Pulsar on cloud platforms, improving performance, stability, and availability while contributing enhancements back to the open‑source community.

high availability Message Queue Apache Pulsar Cluster Migration Tencent Cloud

Written by

Tencent Cloud Middleware

Official account of Tencent Cloud Middleware. Focuses on microservices, messaging middleware and other cloud‑native technology trends, publishing product updates, case studies, and technical insights. Regularly hosts tech salons to share effective solutions.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.