Cloud Native 13 min read

How We Scaled Apache Pulsar on Kubernetes for WeChat’s Billion‑User Real‑Time Recommendations

This article details the WeChat engineering team’s practical experience deploying and optimizing Apache Pulsar on Kubernetes for massive real‑time recommendation workloads, covering cloud‑native advantages, non‑persistent topics, load‑balancing tweaks, broker cache improvements, COS offloader development, and future roadmap.

Tencent Cloud Middleware

Nov 10, 2022

How We Scaled Apache Pulsar on Kubernetes for WeChat’s Billion‑User Real‑Time Recommendations

Background

WeChat processes billions of messages daily for recommendation, risk control, monitoring, and AI platforms. Data is collected via SDKs, routed through messaging middleware (Pulsar) and then consumed by compute engines such as Flink and TensorFlow before being persisted in HDFS, HBase, Redis, etc.

Why Pulsar?

Stateless broker layer combined with replicated, peer‑to‑peer BookKeeper storage enables elastic scaling and high availability.

Resource isolation (soft or hard) prevents interference between services.

Fine‑grained Namespace/Topic policies simplify cluster administration.

Horizontal expansion of both brokers and bookies handles traffic spikes.

Multi‑language client support (C/C++, Python, TensorFlow) matches WeChat’s heterogeneous stack.

Practice 1 – Kubernetes Deployment

The official Pulsar Helm chart was deployed on a Tencent Cloud Kubernetes cluster. The native architecture consists of a Proxy layer, stateless Brokers, and Bookies for persistent storage.

Bottlenecks identified :

Proxy hides client IP, making operations and troubleshooting difficult.

Internal network bandwidth becomes a limit under high load; with three replicas, inbound traffic can reach ~50 GBps and outbound ~30 GBps.

To remove the Proxy bottleneck, brokers were given elastic network interfaces and exposed directly to external clients via a load balancer. Clients perform a Lookup request to obtain the broker address, eliminating the extra Proxy bandwidth.

Additional K8s optimizations:

Proxy removed entirely.

Bookie extended to support multi‑disk, multi‑directory storage (contributed back to the community – PR #113: https://github.com/apache/pulsar-helm-chart/pull/113).

Unified log collection with Tencent Cloud CLS.

Metrics collection via Grafana + Kvass + Thanos, with horizontal scaling after fixing performance bottlenecks.

Practice 2 – Non‑Persistent Topics

In non‑persistent topics, messages bypass the Managed Ledger and are dispatched directly from the Dispatcher to consumers, avoiding BookKeeper I/O. This reduces bandwidth pressure on Bookies and is suitable when occasional data loss is acceptable, such as:

High‑throughput training tasks where consumers cannot keep up.

Latency‑critical training jobs.

Sampling‑based evaluation tasks.

Practice 3 – Load Balancing & Broker Cache Optimization

The production cluster experienced repeated bundle unloads, causing load oscillation between brokers. Original load‑manager configuration:

loadManagerClassName=org.apache.pulsar.broker.loadbalance.impl.ModularLoadManagerImpl
loadBalancerLoadSheddingStrategy=org.apache.pulsar.broker.loadbalance.impl.ThresholdShedder
loadBalancerBrokerThresholdShedderPercentage=10
loadBalancerBrokerOverloadedThresholdPercentage=70
Load bundle processing class: org.apache.pulsar.broker.loadbalance.impl.LeastLongTermMessageRate

Issues:

ThresholdShedder bases decisions on averaged CPU, traffic, and memory, leading to inconsistent bundle migration.

Optimizations applied:

Bundle selection changed to a random broker whose load is below the cluster average.

Even bundle distribution disabled (loadBalancerDistributeBundlesEvenlyEnabled=false). See PR 16059: https://github.com/apache/pulsar/pull/16059.

Weight‑based placement strategy switched to LeastResourceUsageWithWeight. See PR 16281: https://github.com/apache/pulsar/pull/16281.

Result: traffic stabilized and load ping‑pong eliminated.

Broker cache eviction refinement

Original eviction logic removed entries already read by active cursors and entries older than a timestamp:

void doCacheEviction(long maxTimestamp) {
    if (entryCache.getSize() <= 0) return;
    PositionImpl slowestReaderPos = getEarlierReadPositionForActiveCursors();
    if (slowestReaderPos != null) {
        entryCache.invalidateEntries(slowestReaderPos);
    }
    entryCache.invalidateEntriesBeforeTimestamp(maxTimestamp);
}

Improved logic introduces configurable flags to optionally evict based on the earliest *mark‑deleted* position, reducing unnecessary eviction of slow‑consuming data:

void doCacheEviction(long maxTimestamp) {
    if (entryCache.getSize() <= 0) return;
    if (factory.getConfig().isRemoveReadEntriesInCache()) {
        PositionImpl evictionPos;
        if (config.isCacheEvictionByMarkDeletedPosition()) {
            PositionImpl earlierMarkDeleted = getEarlierMarkDeletedPositionForActiveCursors();
            evictionPos = earlierMarkDeleted != null ? earlierMarkDeleted.getNext() : null;
        } else {
            evictionPos = getEarlierReadPositionForActiveCursors();
        }
        if (evictionPos != null) {
            entryCache.invalidateEntries(evictionPos);
        }
    }
    entryCache.invalidateEntriesBeforeTimestamp(maxTimestamp);
}

This refinement improves cache efficiency for workloads with many consumers of varying speeds and frequent task restarts.

Practice 4 – COS Offloader Development

Pulsar’s tiered storage offloader moves aged ledgers to cheaper remote storage. The upstream version did not support Tencent Cloud Object Storage (COS), so a custom COS offloader plugin was developed and deployed in production.

Reasons for using an offloader:

Bookie SSD cost is high.

Long‑term storage volume is massive.

Backlog tolerance requires data retention for extended periods.

Data replay scenarios.

The plugin migrates old ledgers to COS while keeping recent data on Bookies, reducing storage costs without sacrificing availability.

Future Plans

The team will continue collaborating with the Pulsar community, targeting upcoming improvements such as PIP 192 (broker load‑balancing & cache optimization) and PIP 180 (shadow‑topic read amplification mitigation). Monitoring of the broader Pulsar ecosystem, including Flink integration and end‑to‑end data‑lake pipelines, is also ongoing.

cloud-native Kubernetes Load Balancing Apache Pulsar real-time recommendation

Written by

Tencent Cloud Middleware

Official account of Tencent Cloud Middleware. Focuses on microservices, messaging middleware and other cloud‑native technology trends, publishing product updates, case studies, and technical insights. Regularly hosts tech salons to share effective solutions.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.