Avoid Kafka Pitfalls: Ensuring Message Order, Handling Retries, and Preventing Backlog
This article shares a two‑year journey of using Kafka in a high‑traffic restaurant ordering system, covering why message order matters, how network glitches and partition routing cause failures, and the practical retry, partition‑balancing, and database strategies that finally eliminated backlog and duplication issues.
Introduction
The author worked on a kitchen‑display system for a restaurant ordering platform where Kafka was the backbone for transmitting order events. High lunch and dinner traffic required reliable, ordered processing to keep chefs and waiters in sync.
Why Message Order Matters
Orders progress through states such as placed , paid , completed , and canceled . If a later state is processed before an earlier one, data becomes inconsistent. Kafka topics are unordered, but each partition preserves order, so the system must route related messages to the same partition.
Unexpected Failures
Network instability caused occasional timeouts and database connection failures. When the order‑placed message failed, subsequent payment or completion messages could not be persisted, leaving the order invisible to users. The system lacked a failure‑retry mechanism, so the problem amplified.
Solution Process
Initial synchronous retry (3‑5 attempts) blocked other merchants and reduced throughput.
Switched to an asynchronous retry table: failed messages are stored for later processing.
To preserve order, the consumer first checks the retry table for the order ID; if present, the current message is also stored there.
Implemented elastic‑job to retry up to 7 times, then mark the message as failed and send an alert email.
These changes reduced the number of orders completely missing from the UI, leaving only occasional delays.
Message Backlog
As the merchant base grew, the volume of messages increased, causing consumer lag and visible delays (up to half an hour). The root causes identified were:
Oversized message payloads causing extra network and disk I/O.
Uneven partition routing: a few merchants with high order volume were all mapped to the same partition, creating a hotspot.
Optimizations applied:
Trimmed Kafka payloads to only id and status, fetching full details via a synchronous order‑detail API.
Changed routing key from merchant ID to order ID for a more uniform distribution.
Primary‑Key Conflict
High concurrency caused duplicate‑key errors when two processes attempted to insert the same order simultaneously. Instead of adding locks, the team switched to MySQL’s INSERT INTO ... ON DUPLICATE KEY UPDATE syntax, which safely upserts records.
INSERT INTO table (column_list)
VALUES (value_list)
ON DUPLICATE KEY UPDATE
c1 = v1,
c2 = v2,
...;This eliminated the duplicate‑key exceptions.
Database Replication Lag
Occasional 3‑second lag between primary and replica caused the consumer to read stale or missing order data, especially when the end‑to‑end processing time was under 3 seconds. The team added a retry mechanism for empty or incomplete responses, mitigating the impact.
Duplicate Consumption
Kafka’s default at‑least‑once mode can lead to duplicate processing. The system already used the upsert statement, which is idempotent, ensuring that repeated consumption does not corrupt data.
Multi‑Environment Consumption Issues
Both pre‑release and production environments shared the same Kafka cluster and database. A misconfiguration caused the pre environment to consume production topics, leading to message loss in prod. Resetting offsets recovered the missing messages.
Conclusion
The author emphasizes that most problems stemmed from operational oversights rather than Kafka itself. Key takeaways include:
Notify downstream teams before bulk operations.
Stress‑test multithreaded order‑detail calls.
Ensure core services can handle high concurrency.
Monitor message backlog and adjust partition routing.
Use idempotent upserts to handle duplicate consumption.
By iteratively applying these fixes, the system achieved stable, ordered processing with minimal latency.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
dbaplus Community
Enterprise-level professional community for Database, BigData, and AIOps. Daily original articles, weekly online tech talks, monthly offline salons, and quarterly XCOPS&DAMS conferences—delivered by industry experts.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
