Analysis and Optimization of InnoDB lock_wait_thread Contention in a Tencent Cloud Database
The article investigates intermittent slow update performance in a Tencent Cloud internal system caused by massive lock_wait_thread contention, analyzes the underlying InnoDB lock mechanisms and thread behavior, implements a fix by disabling lock_wait_suspend_thread triggers, and demonstrates substantial latency reduction through benchmark results.
1 Problem Phenomenon
Recently, an internal Tencent Cloud system intermittently experienced very slow row‑update operations, with many UPDATE statements taking over 10 seconds and some exceeding 100 seconds, severely affecting normal operation. Operations staff had to kill sessions during off‑peak hours, typically midnight, imposing a heavy burden on online services.
2 Problem Analysis
2.1 Initial Suspicions
Operations staff suspected deadlocks because of the sporadic slow queries. Scripts were used to capture the InnoDB status and thread information when the issue occurred, but analysis of SHOW ENGINE INNODB STATUS showed no deadlock entries, so the cause was sought from the business model perspective.
2.2 Business Model
Three sub‑systems access the same database with a similar pattern: each offline request triggers a workflow that may update the same row multiple times in a short period, retrying up to ten times on lock timeout. In extreme cases, over 2,000 threads update concurrently, leading to massive connection waiting.
The hotspot row updates acquire row locks that are released at transaction commit, waking waiting threads. Under normal load this reduces throughput but does not cause multi‑second waits; therefore the lock acquisition, release, or wake‑up mechanisms were suspected.
2.3 Reproducing the Issue
A offline test environment was built using pt‑pmp and perf to analyze the database. Thread‑wait analysis revealed two abnormal groups: 233 threads waiting on the mutex inside lock_wait_suspend_thread (specifically lock_wait_table_release_slot ) and 181 threads waiting on the mutex of lock_wait_suspend_thread itself.
These mutexes are entered by two functions:
User threads calling lock_wait_suspend_thread
Background thread lock_wait_timeout_thread
lock_wait_suspend_thread suspends all calling threads; when a hotspot row is being updated, only one thread proceeds while the others wait on the row lock. The 1,442 threads shown in the diagram are waiting for this row lock.
lock_wait_timeout_thread monitors lock‑wait timeouts, scanning each suspended thread and waking it if it has timed out. It is triggered either every second or when lock_wait_suspend_thread notifies it—this notification is the key factor that makes hotspot updates slow.
The relationship between the two threads is illustrated in the following diagram:
Each newly waiting process triggers an additional call to lock_wait_timeout_thread , which scans all already‑suspended threads. This creates an n × m increase in lock‑scan operations, further aggravating contention on lock_wait_mutex_enter() and forming a vicious cycle.
3 Problem Solution
The fix disables the call from lock_wait_suspend_thread to lock_wait_timeout_thread , reducing the number of lock‑wait scans under high concurrency and thereby eliminating the hotspot.
4 Results
A simulation tool reproduced the production model with 2,000 concurrent requests, each updating once per second. Execution times before and after the optimization were measured.
Execution Time
5.6 Before Optimization
5.6 After Optimization
All
6968
32460
1‑9 s
1504 (21.6%)
802 (2.47%)
>10 s
12 (0.17%)
0
The proportion of requests taking 1‑9 seconds dropped to one‑tenth of the original, effectively eliminating the hotspot.
pt‑pmp stack traces show that most threads are suspended on lock_wait_suspend_thread 's slot‑event, which is normal. After the fix, only five threads remain waiting on the entry mutex.
The fix was released with the latest MySQL 5.6 version and, after a month of production monitoring, the update‑slow issue has not reappeared, confirming the problem is resolved.
Tencent Database Technology Team provides various database products (e.g., CDB, CTSDB, CKV, CMonGo) and focuses on strengthening core database capabilities, enhancing performance, ensuring system stability, and solving user‑facing problems.
Tencent Database Technology
Tencent's Database R&D team supports internal services such as WeChat Pay, WeChat Red Packets, Tencent Advertising, and Tencent Music, and provides external support on Tencent Cloud for TencentDB products like CynosDB, CDB, and TDSQL. This public account aims to promote and share professional database knowledge, growing together with database enthusiasts.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.