Root Cause Analysis and Optimization of High Load on Alibaba Cloud RocketMQ Consumer Service
The article investigates why a RocketMQ consumer service running on a 4‑core ECS experiences sustained high load despite low CPU, I/O and memory usage, identifies excessive thread creation and frequent trace‑module context switches as the main causes, and proposes configuration and SDK upgrades to resolve the issue.
Background: a service handling MQ messages using Alibaba Cloud RocketMQ SDK 1.2.6 experiences high load on its ECS (4‑core, 8 GB) as consumer count grows beyond 200, causing sustained load spikes despite low CPU, I/O and memory usage.
Load analysis shows load_15m and load_5m staying between 3‑5, while load_1m frequently exceeds the number of cores, indicating intermittent congestion.
Investigation using vmstat and pidstat reveals frequent context switches and interrupts; CPU usage remains low but the system spends much time handling thread scheduling.
ECS配置:4核8G
物理cpu个数=4
单个物理CPU中核(core)的个数=1
单核多处理器Further inspection shows thousands of Java threads (≈9700) and many threads performing >100 context switches per second, especially those belonging to the RocketMQ consumer.
tips:系统load高,不代表cpu资源不足。Load高只是代表需要运行的队列累计过多。但队列中的任务实际可能是耗cpu的,也可能是耗i/0及其他因素的。Root cause identification:
Excessive consumer threads: each consumer creates a thread pool with default core size 20 and max 64; with 200+ consumers this leads to tens of thousands of threads, most idle.
Trace reporting module (AsyncArrayDispatcher) uses a bounded ArrayBlockingQueue; its poll(5 ms) call causes threads to repeatedly block and unblock, generating many context switches.
traceContextQueue.poll(5,TimeUnit.MILLISECONDS);Code analysis shows the trace queue is an ArrayBlockingQueue backed by a ReentrantLock (non‑fair), and blocking is implemented via unsafe.park, which wakes on timeout, interrupt, or unpark.
ArrayBlockingQueue uses non‑fair lock; park blocks thread until one of four conditions occurs.Optimization proposals:
Configure each consumer’s consumeThreadMin/consumeThreadMax to appropriate values to reduce total thread count.
Upgrade to RocketMQ SDK 1.8.5, which adds a switch to use a single trace‑dispatch thread and a single bounded queue for all consumers.
Applying these changes should lower context‑switch overhead, reduce load, and improve overall system stability.
DevOps Operations Practice
We share professional insights on cloud-native, DevOps & operations, Kubernetes, observability & monitoring, and Linux systems.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.