Big Data 17 min read

Boosting Real-Time Recommendations: Apache Pulsar Optimizations at WeChat

This article details how WeChat's Gemini‑2.0 big‑data platform leverages Apache Pulsar, outlining cloud‑native advantages, load‑balancing refinements, cache and SSD tuning, high‑availability safeguards, and cost‑saving strategies that together enable large‑scale, real‑time, deep‑learning recommendation workloads.

WeChat Backend Team

May 17, 2023

Boosting Real-Time Recommendations: Apache Pulsar Optimizations at WeChat

Background

Pulsar serves as the message queue for WeChat's big‑data platform Gemini‑2.0, supporting real‑time data and recommendation scenarios.

Gemini‑2.0 is an internal cloud‑native big‑data platform built on Tencent Cloud TKE, offering compute‑storage separation, unified AI/Big‑Data orchestration, high‑performance compute components, and flexible extensibility.

As recommendation systems evolve into a "large‑scale + full‑real‑time + deep‑learning" era, the data platform must boost processing capacity, making the message queue a critical data bus.

Why Apache Pulsar?

Cloud‑native features: distributed, elastic scaling, read/write separation, stateless brokers, and replicated Bookies.

Resource isolation: soft or hard isolation to prevent cross‑service interference.

Flexible policy control: namespace/topic‑level policies.

Rapid scaling: stateless brokers and peer‑to‑peer Bookies enable quick expansion.

Multi‑language clients for seamless AI component integration.

High Performance – Load‑Balancing Optimization

In production, some brokers experienced fluctuating loads despite overall low cluster utilization. The root cause was the load‑balancing strategy repeatedly unloading and loading bundles based on mismatched metrics.

loadManagerClassName=org.apache.pulsar.broker.loadbalance.impl.ModularLoadManagerImpl
loadBalancerLoadSheddingStrategy=org.apache.pulsar.broker.loadbalance.impl.ThresholdShedder
loadBalancerBrokerThresholdShedderPercentage=10
# 负责加载 Bundle 的处理类为：org.apache.pulsar.broker.loadbalance.impl.LeastLongTermMessageRate

Optimizations included unifying bundle load/unload logic, allowing all brokers as candidates under isolation policies, and preferring nodes below average load. After changes, daily load‑adjustment cycles dropped from ~1000 to single‑digit levels, eliminating frequent rebalancing.

Catch‑Up Read Optimization

Catch‑up reads (consuming historical data) stress the Bookie layer, risking cache eviction and storage overload. By introducing topic‑level cache‑duration settings and deduplicating overlapping read requests, cache hit rates rose from 85% to 95%, and storage read traffic dropped significantly.

SSD Tuning

Modified the Helm chart to allow multiple journal/ledger disks per Bookie, added extra journal directories, and tuned buffer sizes:

writeBufferSizeBytes=67108864
dbStorage_rocksDB_writeBufferSizeMB=128
readBufferSizeBytes=4096

These changes let Bookie throughput approach disk limits and reduce I/O spikes.

High Availability

Overload protection: per‑topic rate limits, traffic‑aware topic segregation, and auto‑scaling when average load exceeds thresholds.

Disaster recovery: multi‑zone deployment of all components, rack‑aware Bookie placement, and increased rereplicationEntryBatchSize to speed up ledger replication.

rereplicationEntryBatchSize=100

Ease of Maintenance

Centralized log collection via Tencent Cloud Log Service (ES backend) for fast querying.

Distributed Prometheus monitoring using Kvass + Thanos, enabling horizontal scaling and per‑cluster isolation.

Integrated alerting for user‑level backlog detection.

Cost Reduction

Optimized network flow by exposing broker IPs for direct client connections, eliminating proxy‑induced bandwidth waste. Leveraged non‑persistent topics for workloads tolerant of data loss, and implemented a COS offloader to move aged ledgers to cheap object storage, dramatically lowering SSD costs.

Conclusion

Through a series of architectural refinements, load‑balancing, cache, SSD, and monitoring optimizations, Pulsar now delivers higher performance, availability, maintainability, and lower cost, fully supporting WeChat's massive, real‑time, deep‑learning recommendation system.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

cloud native performance optimization Big Data real-time data Message Queue Apache Pulsar

Written by

WeChat Backend Team

Official account of the WeChat backend development team, sharing their experience in large-scale distributed system development.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.