Two Years of Kafka in a Restaurant Order System: Problems, Solutions, and Lessons Learned

This article recounts the author's two‑year experience with Kafka in a high‑traffic restaurant ordering system, detailing why message ordering matters, the pitfalls of synchronous retries, message backlog, partition routing, primary‑key conflicts, database replication lag, and practical mitigation strategies for reliable backend processing.

Wukong Talks Architecture
Wukong Talks Architecture
Wukong Talks Architecture
Two Years of Kafka in a Restaurant Order System: Problems, Solutions, and Lessons Learned

The author, a backend engineer for a restaurant ordering platform, explains how their system relies on Kafka messages to synchronize order status from the order service to a kitchen display application, enabling chefs to see and fulfill orders in real time.

Ensuring message order is crucial because processing a "payment" or "cancellation" before the corresponding "order" message would corrupt the state; the solution is to write related messages to the same Kafka partition, typically keyed by merchant ID or order ID, so each consumer reads them in order.

When network instability caused the initial "order" message to fail, subsequent messages were lost, leading to invisible orders; the team introduced an asynchronous retry mechanism that stores failed messages in a retry table and reprocesses them, while also limiting retry attempts to avoid blocking other partitions.

Message backlog emerged as the number of merchants grew. The root causes identified were oversized message payloads (causing extra network and disk I/O), uneven partition distribution due to merchant‑ID routing, and batch jobs that flooded a single partition. Solutions included trimming payloads to essential fields, routing by order ID for better balance, and increasing consumer parallelism with a thread‑pool and elastic‑job for retry handling.

Primary‑key conflicts under high concurrency were resolved by using MySQL's INSERT INTO ... ON DUPLICATE KEY UPDATE syntax, eliminating the need for explicit locks.

Database master‑slave replication lag (up to three seconds) caused occasional missing order details; the team added retry logic when queries returned incomplete data.

To avoid duplicate consumption, the author notes Kafka's delivery semantics (at‑most‑once, at‑least‑once, exactly‑once) and emphasizes designing idempotent operations, which the INSERT ... ON DUPLICATE KEY UPDATE pattern naturally provides.

Cross‑environment consumption issues were highlighted when a pre‑release consumer mistakenly subscribed to a production topic, causing message loss; the fix was to prefix topics per environment and reset offsets when needed.

Overall, the article shares practical troubleshooting steps, architectural adjustments, and operational best practices for building a robust, high‑throughput backend system using Kafka.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

Kafkatroubleshootingdistributed-systemsmessage-queue
Wukong Talks Architecture
Written by

Wukong Talks Architecture

Explaining distributed systems and architecture through stories. Author of the "JVM Performance Tuning in Practice" column, open-source author of "Spring Cloud in Practice PassJava", and independently developed a PMP practice quiz mini-program.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.