How We Resolved Repeated Kafka Message Backlogs in a High‑Traffic Restaurant System
This article recounts a series of Kafka message backlog incidents in a restaurant ordering system and explains how targeted optimizations—batch database queries, index tuning, data archiving, and thread‑pool scaling—eliminated the delays and restored reliable kitchen display performance.
Preface
I worked in a restaurant company where during lunch and dinner peaks the order system generated high concurrency. The kitchen display system receives order messages via Kafka, processes them, persists order and dish data, and shows them to the kitchen client.
When Kafka fails, the kitchen display is affected.
This article shares the message backlog problems we encountered and how we solved them.
1 First Message Backlog
Initially traffic was low and Kafka worked fine. As user volume grew, the number of orders increased, causing the kitchen display table to grow. One noon we received complaints about delayed dish lists; the kitchen saw dishes minutes later.
Investigation pointed to Kafka message backlog. Common causes are consumer crash or producer speed exceeding consumer speed. Our consumer was running, so the consumer processing speed had slowed.
We added detailed logging to measure processing time at key points.
We identified two long‑running operations:
A for‑loop that queried the database one record at a time.
A multi‑condition query without proper indexing.
We optimized by replacing the per‑record queries with a batch query using a list of IDs:
public List<User> queryUser(List<User> searchList) {
if (CollectionUtils.isEmpty(searchList)) {
return Collections.emptyList();
}
List<Long> ids = searchList.stream()
.map(User::getId)
.collect(Collectors.toList());
return userMapper.getUserByIds(ids);
}We also added a composite index for the multi‑condition query. After these changes the consumer processed messages much faster and the backlog disappeared.
2 Second Message Backlog
Months later a sporadic backlog appeared. Monitoring and DBA slow‑query reports showed that some SQL statements used different indexes for identical WHERE conditions, causing sub‑optimal plans.
We forced the correct index with FORCE INDEX. This resolved the small‑scale backlog.
3 Third Message Backlog
Six months later the kitchen display table grew to 30 million rows. Large tables degrade query and write performance, leading to slow processing.
We chose to archive historical data and keep only the latest 30 days in the active table, effectively reducing the active row count to a few million.
This partitioning‑by‑time approach eliminated the backlog.
4 Fourth Message Backlog
After a year another backlog occurred in the afternoon, an unusual time. Investigation revealed that a batch job updated tens of thousands of order statuses, generating a sudden surge of Kafka messages that the consumer could not keep up with.
We increased the thread pool size for the consumer (core threads to 50, max to 50) and later settled on core 8, max 10 for normal operation. The backlog of hundreds of thousands of messages was cleared in about 20 minutes.
Note: Using a thread pool to consume messages is not a universal solution; it may affect message order and increase CPU usage, and can overload third‑party services.
In summary, message backlog is caused by producer speed exceeding consumer speed, but the root causes vary: consumer slowdown, inefficient SQL, lack of indexing, data volume, or sudden spikes. Monitoring, proper indexing, batch queries, data archiving, and scalable consumer threading are effective mitigation strategies.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
macrozheng
Dedicated to Java tech sharing and dissecting top open-source projects. Topics include Spring Boot, Spring Cloud, Docker, Kubernetes and more. Author’s GitHub project “mall” has 50K+ stars.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
