How We Resolved a Kafka Consumer Production Outage Step by Step
The article recounts a production incident where a Kafka‑based consumer in a finance microservice hit thread‑pool exhaustion and slow‑query alerts, analyzes the root causes of async processing and bulk message bursts, and outlines a three‑phase remediation that includes data repair, switching to synchronous consumption, and request‑level batching to prevent future failures.
Problem Background
The business requires that after the after‑sale period, the finance platform settles orders and credits virtual assets to customers' virtual accounts. To decouple the order service from the finance platform, the team uses asynchronous MQ messages (Kafka) with the following flow:
Order Center queries overdue orders and sends an MQ message to the Finance Center.
Finance Center receives the message, validates transaction data, and calls the fund platform to settle points.
Fund Platform settles points and credits virtual assets.
The Finance Center consumes the MQ using a custom Kafka wrapper that processes messages asynchronously with an internal thread pool, a pattern that has been stable for a year.
Incident Details
At 06:00 a P1 alarm triggered: the internal thread pool was saturated, causing 737 task rejections and a 10‑second SQL slow‑query (the validation step accessed a non‑indexed column).
Root‑Cause Analysis
The consumer method in the Finance Center uses the default asynchronous mode with 200 threads. When a burst of messages arrives faster than they can be processed, the pool reaches its limit, triggers the rejection policy, and also impacts other async consumers that share the same pool.
The Order Center updates overdue orders in a tight for loop, sending a separate MQ message for each order. This massive simultaneous message burst overwhelms the consumer that already has the thread‑pool bottleneck, leading to unprocessed messages.
Remediation Process
The team identified three concrete steps:
Data Repair: Re‑run the SQL logic for the records that failed during consumption to immediately restore correct data and avoid customer impact.
Switch to Synchronous Consumption: Disable the asynchronous execution for the affected topic, removing the thread‑pool dependency. Implement a synchronous listener (e.g.,
@KafkaListener(topic = "xxx", groupId = "appName.beanName.methodName")) that leverages Kafka’s consumer‑group mechanism to handle back‑pressure internally. Future optimization will remove the async mode entirely.
Business‑Level Mitigation: Evaluate the request pattern and merge frequent calls. For the loop that sends order notifications, validation, and fund settlement, batch the requests where possible, then split large batches (e.g., 100 k orders) into manageable chunks to avoid overwhelming downstream services.
The article also shares a demand‑analysis method used by the technical director for eight years (illustrated by the accompanying diagram).
Problem Summary
Immediately fix production data to protect customer experience.
Apply a temporary fix that targets the root cause (e.g., switch to sync consumption).
Implement long‑term avoidance by rationally assessing demand and batching requests.
Architect's Journey
E‑commerce, SaaS, AI architect; DDD enthusiast; SKILL enthusiast
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
