Backend Development 6 min read

Root Cause Analysis and Optimization of RocketMQ Message Dispatch in the Tesla Interface Testing Platform

The article investigates a performance bottleneck in Tesla's interface testing platform where large test tasks experience up to nine‑minute report delays, identifies the hidden latency in RocketMQ message dispatch to handlers, and proposes a redesign of channel scheduling that reduces execution time to under four minutes.

Byte Quality Assurance Team

Mar 31, 2021

Root Cause Analysis and Optimization of RocketMQ Message Dispatch in the Tesla Interface Testing Platform

Today we discuss a performance issue originating from the test case execution layer of the Tesla interface testing platform, which relies on RocketMQ for message queuing. Large tasks (over 500 test case groups) sometimes take up to nine minutes to return reports.

Initial investigation of several typical reports showed that consumption of certain test case group messages was significantly delayed, causing the overall task to appear inefficient because the task is considered complete only after all messages are consumed.

We examined possible causes such as execution‑machine bottlenecks, slow message generation or MQ dispatch, insufficient consumer channel size, and slow request handling, but metrics and logs disproved each hypothesis.

The missing piece was the latency between "MQ dispatch" and "handler start processing", which is not captured by RocketMQ's built‑in timing statistics or handler logs.

Deep‑code analysis of the RocketMQ client SDK revealed that messages are dispatched to worker channels using an indiscriminate random or round‑robin algorithm. When a worker channel receives a "long‑tail" message (e.g., a test case group containing hundreds of cases), the channel becomes full, blocking further dispatch and causing downstream messages to be delayed.

To solve the problem we introduced an AggressiveMode to the RocketMQ SDK and changed the scheduling logic: each worker channel size is set to 1, ensuring only one message occupies a channel at a time, and we use reflection to create dynamic SelectCase statements so that any cleared channel immediately receives a new message.

After deployment, heavy‑user testing showed task execution time dropping from 7‑9 minutes to 30 seconds‑4 minutes, confirming the issue was resolved.

The key lessons are: (1) when message consumption times are uneven, indiscriminate random or round‑robin dispatch should be avoided; (2) monitoring and logs may not cover all pipeline stages, and hidden blind spots can harbor performance problems.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Performance Optimization Concurrency Message Queue RocketMQ Channel Scheduling

Written by

Byte Quality Assurance Team

World-leading audio and video quality assurance team, safeguarding the AV experience of hundreds of millions of users.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.