TiDB Cluster Write‑Write Conflict Investigation and Resolution
This article analyzes a TiDB cluster performance incident where QPS dropped and duration spiked due to write‑write conflicts, detailing the monitoring data, root‑cause investigation of server‑busy and scheduler latch issues, and the attempted mitigation steps such as enabling txn‑local‑latches and adjusting insert statements.
Problem Background – A production TiDB cluster imported data into new physical partitions using shard_row_id_bit and pre_split_region to avoid hotspots. A few days later, at around 01:24 on June 21, QPS sharply declined and query duration surged.
Observed Symptoms – Monitoring showed a sudden drop in QPS, increased duration , and alerts such as server is busy and kv:9007 Write conflict . Region count grew slowly without large‑scale balance, and PD reported stagnant disk usage.
Cluster Configuration
集群版本:v3.0.5
集群配置:普通SSD磁盘,128G内存,40 核cpu
TiDB nodes: tidb21, tidb22, …
TiKV nodes: tidb01‑tidb20, wtidb29, wtidb30Analysis Process – The server is busy alert pointed to a TiKV node (IP ending with 218). Logs showed no obvious errors, so the node was restarted, which moved pending commands and worker CPU load away from it.
Log extraction command used:
cat 218.log | grep conflict | awk -F 'tableID=' '{print $2}'Resulting logs revealed 1,147 write‑write conflicts within ten minutes, each containing fields such as kv:9007 , txnStartTS , conflictStartTS , conflictCommitTS , and the conflicting key (tableID, indexID, indexValues).
Using pd-ctl to convert timestamps and curl to map tableID to table names, the investigation confirmed that conflicts were concentrated on specific keys and regions.
Version Difference – Prior to TiDB v3.0.8 the default optimistic transaction model performed conflict detection only at COMMIT, causing high duration when many write‑write conflicts occurred. From v3.0.8 onward, the default pessimistic model adds locks during each DML, preventing such conflicts without application changes.
Root Cause – Write‑write conflicts triggered TiKV scheduler latch waiting, especially on the overloaded node (IP 218). Scheduler worker CPU spiked, and the node reported many “not leader” errors because regions were busy handling conflicting writes.
Mitigation Attempts
Enabled tidb_txn_local_latches to shift latch waiting to TiDB, hoping to relieve TiKV pressure.
Adjusted application SQL from plain INSERT to INSERT IGNORE so duplicate key errors (error 1062) are handled by the database instead of causing repeated conflicts.
After parameter changes, the issue persisted, but switching to INSERT IGNORE eliminated the spike, restoring normal QPS and duration.
Conclusion – The incident was caused by concentrated write‑write conflicts on a few hot keys/regions, leading to scheduler latch bottlenecks and server‑busy errors. Proper transaction mode (pessimistic) and idempotent insert logic are effective preventive measures.
360 Tech Engineering
Official tech channel of 360, building the most professional technology aggregation platform for the brand.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.