Boosting Real-Time Recommendations: Apache Pulsar Optimizations at WeChat
This article details how WeChat's Gemini‑2.0 big‑data platform leverages Apache Pulsar, outlining cloud‑native advantages, load‑balancing refinements, cache and SSD tuning, high‑availability safeguards, and cost‑saving strategies that together enable large‑scale, real‑time, deep‑learning recommendation workloads.
Background
Pulsar serves as the message queue for WeChat's big‑data platform Gemini‑2.0, supporting real‑time data and recommendation scenarios.
Gemini‑2.0 is an internal cloud‑native big‑data platform built on Tencent Cloud TKE, offering compute‑storage separation, unified AI/Big‑Data orchestration, high‑performance compute components, and flexible extensibility.
As recommendation systems evolve into a "large‑scale + full‑real‑time + deep‑learning" era, the data platform must boost processing capacity, making the message queue a critical data bus.
Why Apache Pulsar?
Cloud‑native features: distributed, elastic scaling, read/write separation, stateless brokers, and replicated Bookies.
Resource isolation: soft or hard isolation to prevent cross‑service interference.
Flexible policy control: namespace/topic‑level policies.
Rapid scaling: stateless brokers and peer‑to‑peer Bookies enable quick expansion.
Multi‑language clients for seamless AI component integration.
High Performance – Load‑Balancing Optimization
In production, some brokers experienced fluctuating loads despite overall low cluster utilization. The root cause was the load‑balancing strategy repeatedly unloading and loading bundles based on mismatched metrics.
loadManagerClassName=org.apache.pulsar.broker.loadbalance.impl.ModularLoadManagerImpl
loadBalancerLoadSheddingStrategy=org.apache.pulsar.broker.loadbalance.impl.ThresholdShedder
loadBalancerBrokerThresholdShedderPercentage=10
# 负责加载 Bundle 的处理类为:org.apache.pulsar.broker.loadbalance.impl.LeastLongTermMessageRateOptimizations included unifying bundle load/unload logic, allowing all brokers as candidates under isolation policies, and preferring nodes below average load. After changes, daily load‑adjustment cycles dropped from ~1000 to single‑digit levels, eliminating frequent rebalancing.
Catch‑Up Read Optimization
Catch‑up reads (consuming historical data) stress the Bookie layer, risking cache eviction and storage overload. By introducing topic‑level cache‑duration settings and deduplicating overlapping read requests, cache hit rates rose from 85% to 95%, and storage read traffic dropped significantly.
SSD Tuning
Modified the Helm chart to allow multiple journal/ledger disks per Bookie, added extra journal directories, and tuned buffer sizes:
writeBufferSizeBytes=67108864
dbStorage_rocksDB_writeBufferSizeMB=128
readBufferSizeBytes=4096These changes let Bookie throughput approach disk limits and reduce I/O spikes.
High Availability
Overload protection: per‑topic rate limits, traffic‑aware topic segregation, and auto‑scaling when average load exceeds thresholds.
Disaster recovery: multi‑zone deployment of all components, rack‑aware Bookie placement, and increased rereplicationEntryBatchSize to speed up ledger replication.
rereplicationEntryBatchSize=100Ease of Maintenance
Centralized log collection via Tencent Cloud Log Service (ES backend) for fast querying.
Distributed Prometheus monitoring using Kvass + Thanos, enabling horizontal scaling and per‑cluster isolation.
Integrated alerting for user‑level backlog detection.
Cost Reduction
Optimized network flow by exposing broker IPs for direct client connections, eliminating proxy‑induced bandwidth waste. Leveraged non‑persistent topics for workloads tolerant of data loss, and implemented a COS offloader to move aged ledgers to cheap object storage, dramatically lowering SSD costs.
Conclusion
Through a series of architectural refinements, load‑balancing, cache, SSD, and monitoring optimizations, Pulsar now delivers higher performance, availability, maintainability, and lower cost, fully supporting WeChat's massive, real‑time, deep‑learning recommendation system.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
WeChat Backend Team
Official account of the WeChat backend development team, sharing their experience in large-scale distributed system development.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
