How a TiDB Write Conflict Crashed a Cluster and What Fixed It
This article analyzes a TiDB production cluster outage caused by write‑write conflicts, walks through the monitoring data and log investigation, explains the underlying transaction model differences, and shares the step‑by‑step troubleshooting process that led to the final resolution.
Problem Background
A business imported data into a newly created physical partitioned table, using shard_row_id_bit and pre_split_region to avoid hotspots. A few days later, in the early morning, the cluster experienced a sharp QPS drop and abnormal duration spikes.
Symptoms and Analysis
At around 01:24 on June 21, QPS sharply declined and duration increased. Monitoring showed a gradual rise in region count, no large‑scale region balance, and disk usage stagnation on PD nodes. TiDB‑>KV error panels displayed "server is busy" alerts, indicating possible Raft store thread blockage or excessive write flow control.
Investigation identified node 218 (IP ending with .218) as the problematic TiKV instance, with unusually high pending commands and scheduler worker CPU usage. Restarting this node temporarily reduced pending commands and CPU load.
Log Analysis
<code>cat 218.log | grep conflict | awk -F 'tableID=' '{print $2}'</code> <code>SELECT * FROM information_schema.tables WHERE tidb_table_id='93615';</code> <code>["commit failed"] [conn=250060] ["finished txn"="Txn{state=invalid}"] [error="[kv:9007]Write conflict, txnStartTS=417517629610917903, conflictStartTS=417517692315762921, conflictCommitTS=417517692315762921, key={tableID=93643, indexID=1, indexValues={...}} ..."]</code>The logs revealed 1,147 write‑write conflicts for a specific table within ten minutes, detailing timestamps, key information, and conflict nature.
Version Differences
Before TiDB v3.0.8, the default optimistic transaction model performed conflict detection only at commit time, leading to higher duration when many write conflicts occurred. Starting with v3.0.8, TiDB switched to a pessimistic model that acquires locks during each DML, preventing most commit‑time conflicts without application changes.
Causes of Write Conflict
TiDB uses the Percolator two‑phase commit. Write conflicts arise during the prewrite phase when another transaction has already written the same key (its commit_ts > current transaction's start_ts ). Whether TiDB retries depends on tidb_disable_txn_auto_retry and tidb_retry_limit settings.
Investigation Steps
Monitoring
Initial monitoring indicated a busy TiKV (server is busy). Exported TiDB and TiKV‑details metrics for the abnormal period.
Query Details
Write‑write conflicts spiked sharply around 01:15 on the 21st.
KV Duration
KV duration concentrated on store 16 (node 218), with timeouts appearing from 01:15 onward.
Errors
Raftstore error panels showed many "not leader" messages for node 218, caused by overloaded regions unable to respond.
gRPC
gRPC message count dropped while message duration doubled, indicating slower prewrite processing.
Thread CPU
Raftstore and async apply CPU usage fell after 01:15, matching the gRPC slowdown; scheduler worker CPU on node 218 remained high, reflecting heavy task scheduling.
Storage
Async write duration decreased sharply after 01:15, suggesting the bottleneck was not in write persistence but elsewhere.
RocksDB – KV
WAL and write duration in RocksDB also dropped, confirming the issue was not in RocksDB or apply stages.
Scheduler – prewrite
Scheduler‑prewrite metrics showed rising command and latch‑wait durations, aligning with increased gRPC prewrite latency. The root cause was identified as long scheduler‑prewrite wait times.
Conclusion
The outage was caused by intensive write‑write conflicts that saturated TiKV scheduler latches on a few hot keys/regions, leading to server timeouts. Enabling tidb txn‑local‑latches was attempted to shift latch pressure to TiDB, but did not alleviate the problem. Ultimately, the application logic was adjusted: the INSERT statements were changed to INSERT IGNORE , allowing duplicate key errors (error 1062) to be handled gracefully, which eliminated the conflict spikes and restored normal QPS and duration.
360 Zhihui Cloud Developer
360 Zhihui Cloud is an enterprise open service platform that aims to "aggregate data value and empower an intelligent future," leveraging 360's extensive product and technology resources to deliver platform services to customers.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.