Why MultiTopicsConsumerImpl Slowed Down Pulsar and How We Boosted Its Throughput 4×
A Pulsar community expert investigated why MultiTopicsConsumerImpl delivered only a fraction of the expected throughput, identified lock contention and EventLoop overhead as the main culprits, applied lock‑removal and thread‑pool optimizations, and achieved nearly four‑fold performance gains.
Background
The Pulsar community asked for help diagnosing a performance issue where the MultiTopicsConsumerImpl, which should aggregate multiple ConsumerImpl instances for a multi‑partition topic, performed worse than a single ConsumerImpl.
Problem Statement
Despite the expectation that parallel consumption across partitions would increase throughput, the observed throughput of MultiTopicsConsumerImpl was only about one‑seventh of the ConsumerImpl.
Test Setup
A three‑node Pulsar cluster was deployed on 8‑core, 16 GB VMs, with topics created with four partitions. The built‑in pulsar-perf tool was used to benchmark both implementations for a two‑minute consumption period.
bin/pulsar-perf consume -u 'http://x.x.x.x:8080' -s my-sub-6 -sp Earliest -q 100000 persistent://public/default/p-topicInitial Findings
Performance numbers showed:
MultiTopicsConsumerImpl: 11,715,556 records, 68,813.420 msg/s, 537.605 Mbit/s ConsumerImpl: 78,403,434 records, 462,640.204 msg/s, 3,614.377 Mbit/sFlame graphs revealed that 40.65% of CPU time was spent in business threads, with 14% in MessageReceived and 8.22% in re‑entrant locks; overall lock contention accounted for ~20% of the time.
Optimization Steps
Replace custom locking with the thread‑safe BlockingQueue to eliminate redundant locks.
Reduce lock acquisition frequency by adding pre‑checks before attempting to lock.
Refactor logic to remove unnecessary locks entirely.
Where possible, substitute re‑entrant locks with read‑write locks for better concurrency.
Lock‑Removal Results
// before optimization
Aggregated throughput stats --- 11715556 records --- 68813.420 msg/s --- 537.605 Mbit/s // after optimization
Aggregated throughput stats --- 25062077 records --- 161656.814 msg/s --- 1262.944 Mbit/sEventLoop Optimization
Further profiling showed that Netty's EventLoopGroup consumed 12.63% of CPU time due to frequent system calls ( Native.eventFdWrite). Replacing the Netty EventLoop with a standard ThreadPoolExecutor using a BlockingQueue reduced this overhead.
// before EventLoop optimization
Aggregated throughput stats --- 11715556 records --- 68813.420 msg/s --- 537.605 Mbit/s // after EventLoop optimization
Aggregated throughput stats --- 18392800 records --- 133314.602 msg/s --- 1041.520 Mbit/sFinal Performance
Combining lock removal and EventLoop replacement yielded a near‑four‑fold increase for MultiTopicsConsumerImpl:
// final results
MultiTopicsConsumerImpl before: 11,715,556 records, 68,813.420 msg/s, 537.605 Mbit/s
MultiTopicsConsumerImpl after: 40,140,549 records, 275,927.749 msg/s, 2,155.686 Mbit/s
ConsumerImpl (baseline): 78,403,434 records, 462,640.204 msg/s, 3,614.377 Mbit/sConclusion
The primary bottlenecks were lock contention and excessive EventLoop usage; eliminating unnecessary locks and switching to a more efficient thread pool dramatically improved throughput. Although the optimized MultiTopicsConsumerImpl still reaches only about 50% of the single ConsumerImpl’s performance, further architectural tweaks could close the gap.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Tencent Cloud Middleware
Official account of Tencent Cloud Middleware. Focuses on microservices, messaging middleware and other cloud‑native technology trends, publishing product updates, case studies, and technical insights. Regularly hosts tech salons to share effective solutions.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
