Databases 11 min read

How a TiDB Write Conflict Crashed a Cluster and What Fixed It

This article analyzes a TiDB production cluster outage caused by write‑write conflicts, walks through the monitoring data and log investigation, explains the underlying transaction model differences, and shares the step‑by‑step troubleshooting process that led to the final resolution.

360 Zhihui Cloud Developer

Aug 5, 2020

How a TiDB Write Conflict Crashed a Cluster and What Fixed It

Problem Background

A business imported data into a newly created physical partitioned table, using shard_row_id_bit and pre_split_region to avoid hotspots. A few days later, in the early morning, the cluster experienced a sharp QPS drop and abnormal duration spikes.

Symptoms and Analysis

At around 01:24 on June 21, QPS sharply declined and duration increased. Monitoring showed a gradual rise in region count, no large‑scale region balance, and disk usage stagnation on PD nodes. TiDB‑>KV error panels displayed "server is busy" alerts, indicating possible Raft store thread blockage or excessive write flow control.

Investigation identified node 218 (IP ending with .218) as the problematic TiKV instance, with unusually high pending commands and scheduler worker CPU usage. Restarting this node temporarily reduced pending commands and CPU load.

Log Analysis

cat 218.log | grep conflict | awk -F 'tableID=' '{print $2}'

SELECT * FROM information_schema.tables WHERE tidb_table_id='93615';

["commit failed"] [conn=250060] ["finished txn"="Txn{state=invalid}"] [error="[kv:9007]Write conflict, txnStartTS=417517629610917903, conflictStartTS=417517692315762921, conflictCommitTS=417517692315762921, key={tableID=93643, indexID=1, indexValues={...}} ..."]

The logs revealed 1,147 write‑write conflicts for a specific table within ten minutes, detailing timestamps, key information, and conflict nature.

Version Differences

Before TiDB v3.0.8, the default optimistic transaction model performed conflict detection only at commit time, leading to higher duration when many write conflicts occurred. Starting with v3.0.8, TiDB switched to a pessimistic model that acquires locks during each DML, preventing most commit‑time conflicts without application changes.

Causes of Write Conflict

TiDB uses the Percolator two‑phase commit. Write conflicts arise during the prewrite phase when another transaction has already written the same key (its commit_ts > current transaction's start_ts). Whether TiDB retries depends on tidb_disable_txn_auto_retry and tidb_retry_limit settings.

Investigation Steps

Monitoring

Initial monitoring indicated a busy TiKV (server is busy). Exported TiDB and TiKV‑details metrics for the abnormal period.

Query Details

Write‑write conflicts spiked sharply around 01:15 on the 21st.

KV Duration

KV duration concentrated on store 16 (node 218), with timeouts appearing from 01:15 onward.

Errors

Raftstore error panels showed many "not leader" messages for node 218, caused by overloaded regions unable to respond.

gRPC

gRPC message count dropped while message duration doubled, indicating slower prewrite processing.

Thread CPU

Raftstore and async apply CPU usage fell after 01:15, matching the gRPC slowdown; scheduler worker CPU on node 218 remained high, reflecting heavy task scheduling.

Storage

Async write duration decreased sharply after 01:15, suggesting the bottleneck was not in write persistence but elsewhere.

RocksDB – KV

WAL and write duration in RocksDB also dropped, confirming the issue was not in RocksDB or apply stages.

Scheduler – prewrite

Scheduler‑prewrite metrics showed rising command and latch‑wait durations, aligning with increased gRPC prewrite latency. The root cause was identified as long scheduler‑prewrite wait times.

Conclusion

The outage was caused by intensive write‑write conflicts that saturated TiKV scheduler latches on a few hot keys/regions, leading to server timeouts. Enabling tidb txn‑local‑latches was attempted to shift latch pressure to TiDB, but did not alleviate the problem. Ultimately, the application logic was adjusted: the INSERT statements were changed to INSERT IGNORE, allowing duplicate key errors (error 1062) to be handled gracefully, which eliminated the conflict spikes and restored normal QPS and duration.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Scheduler TiDB Cluster Troubleshooting write conflict

Written by

360 Zhihui Cloud Developer

360 Zhihui Cloud is an enterprise open service platform that aims to "aggregate data value and empower an intelligent future," leveraging 360's extensive product and technology resources to deliver platform services to customers.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.