Why Did Our Payment System Auto‑Recover? A Deep Dive into Queue Backlog and Transaction Locks
A new employee at an OTA company faced a mysterious outage where thousands of payment‑related messages piled up in the queue, the system auto‑recovered, and a detailed investigation revealed a stuck MySQL transaction caused by missing response timeout settings, leading to lock contention and message backlog.
Fault Description
A veteran OTA company, whose orders originally came from PC sites and call centers, experienced a baffling incident shortly after a new employee joined. Early one sunny morning, alarms indicated a massive buildup of messages in the online queue, affecting credit‑card and distribution payments while other payment methods remained functional. The employee logged into the bastion host, inspected logs on dozens of machines but found no errors, and observed that the queue held several thousand pending messages (e.g., counts of 604 and 881).
Root Cause Analysis
Further investigation uncovered that a scheduled job responsible for consuming messages was repeatedly failing on all four high‑availability nodes because the underlying SQL query could not acquire the necessary lock. The job’s failure caused producers to keep inserting messages, leading to the backlog.
Two diagnostic approaches were considered:
Review all code related to the table, which proved impractical due to the difficulty of reproducing the scenario.
Leverage external help by examining the database layer to identify why the SQL could not execute.
The DBA, after a day of work, identified two key issues:
An uncommitted transaction held a lock because the connection was never closed, causing lock‑wait timeouts on subsequent updates.
The idle connection remained open for 3600 seconds (the same duration as the outage) before MySQL terminated it, finally releasing the lock.
Fundamental Cause
The payment application inserts a record into the queue and, within the same transaction, calls a third‑party service via httpclient. Only the connection timeout (30 seconds) was set; the response timeout (soTimeout) was omitted. When network issues occurred, the call blocked indefinitely, preventing the transaction from committing.
The scheduled job then attempted to update the queue records but could not obtain the table lock, repeatedly reporting "lock wait timeout" errors. This pattern is common in Spring AOP‑based services where remote calls are wrapped inside a transaction.
Resolution
Temporary fix: Configure a proper response timeout for the HTTP client to reduce the chance of the issue recurring.
Long‑term solution: Refactor the framework to use programmatic transactions and separate all remote calls from the transactional context.
Knowledge Points
Transaction management and Spring AOP pitfalls.
httpclient timeout configuration (connectionTimeout vs. soTimeout).
Reference: https://tech.meituan.com/2018/04/19/trade-high-availability-in-action.html
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
dbaplus Community
Enterprise-level professional community for Database, BigData, and AIOps. Daily original articles, weekly online tech talks, monthly offline salons, and quarterly XCOPS&DAMS conferences—delivered by industry experts.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
