How We Fixed Repeated Kafka Message Backlog in a High‑Traffic Restaurant System
This article details a series of Kafka message backlog incidents in a busy restaurant ordering system, analyzes root causes such as consumer speed, database indexing, and sudden traffic spikes, and presents step‑by‑step optimizations—including batch queries, index tuning, data archiving, and thread‑pool adjustments—to restore real‑time performance.
Introduction
I worked in a restaurant company where the ordering system sends kafka messages to a kitchen display system; the system persists orders and dishes, then notifies chefs and waiters. The kitchen relies on these messages to know which dishes to prepare and serve, so any Kafka issue directly impacts user experience.
The article discusses several message backlog problems we encountered and the solutions applied.
1. First Backlog
As user volume grew, the kitchen display showed delayed dish lists. Investigation revealed Kafka message backlog. Common causes are consumer crashes or producer speed exceeding consumer speed. Our consumers were running, but processing slowed.
Two slow spots were identified:
A loop that queried the database for each user individually.
A multi‑condition query without proper indexing.
We optimized by replacing the per‑user query with a batch query:
public List<User> queryUser(List<User> searchList) {
if (CollectionUtils.isEmpty(searchList)) {
return Collections.emptyList();
}
List<Long> ids = searchList.stream().map(User::getId).collect(Collectors.toList());
return userMapper.getUserByIds(ids);
}We also added a composite index to the multi‑condition query, which significantly increased consumer throughput and resolved the backlog.
2. Second Backlog
Months later a sporadic backlog appeared. Monitoring and DBA slow‑query reports showed identical WHERE clauses with different parameter values causing MySQL to choose sub‑optimal indexes.
We forced the correct index using FORCE INDEX, which eliminated the small‑scale backlog.
3. Third Backlog
Six months later the order table grew to 30 million rows, slowing queries. Since sharding was not yet feasible, we archived data older than 30 days to a history table, keeping the active table small and restoring performance.
4. Fourth Backlog
A sudden backlog occurred in the afternoon after a batch job updated tens of thousands of order statuses, flooding Kafka with messages that consumers could not keep up with.
We increased the thread‑pool core and maximum threads to 50, allowing faster consumption. After the surge, we reset the pool to a safer size (core 8, max 10) for normal operation.
Note: Using a thread pool to consume MQ messages is not a universal solution; it can affect message order, increase CPU usage, and overload downstream services.
In summary, Kafka backlog can stem from producer‑consumer speed imbalance, inefficient database access, or sudden traffic spikes. Effective mitigation requires monitoring, targeted code optimizations, proper indexing, data archiving, and scalable consumer concurrency.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Su San Talks Tech
Su San, former staff at several leading tech companies, is a top creator on Juejin and a premium creator on CSDN, and runs the free coding practice site www.susan.net.cn.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
