Operations 28 min read

How to Recover a TiKV Cluster After Multiple Node Failures

This article demonstrates how to simulate and recover TiKV cluster failures by shutting down one, two, or three nodes, explains the impact on Raft groups and region availability, and provides step‑by‑step commands for disabling PD scheduling, using tikv‑ctl, and restoring data integrity.

Xiaolei Talks DB
Xiaolei Talks DB
Xiaolei Talks DB
How to Recover a TiKV Cluster After Multiple Node Failures

All the following analysis assumes a TiKV region with the default three‑replica configuration.

In a large‑scale deployment, each region’s three replicas are scheduled on three different TiKV nodes. If a majority of replicas become unavailable, the region is unavailable. When two TiKV nodes fail simultaneously, there is a high probability that a Raft group’s two peers are on the failed nodes, which is why the article begins with a probability discussion.

Cluster availability means the whole cluster continues to serve requests without DBA intervention. For example, in a five‑node TiKV cluster, a single node failure triggers leader election and follower promotion, and after about 30 minutes the remaining nodes replicate the missing replica.

Data loss is defined as the situation where, despite having three replicas, a majority (two) become unavailable while at least one replica remains; the region is then read‑only/unavailable.

Experiment Simulating 5‑Node Cluster Failure

Test Environment

Ten tables with 500,000 rows each are loaded using sysbench, and sysbench read traffic simulates requests.

<code>sysbench /usr/share/sysbench/oltp_read_write.lua --mysql-host=10.xxxx.160 --mysql-port=4000 --mysql-db=test --mysql-user=root --mysql-password='xxx' --table_size=500000 --tables=10 --threads=30 --time=220 --report-interval=10 --db-driver=mysql  prepare
sysbench /usr/share/sysbench/oltp_read_only.lua --mysql-host=10.xxxx.160 --mysql-port=4000 --mysql-db=test --mysql-user=root --mysql-password='xxx' --table_size=500000 --tables=10 --threads=30 --time=2000 --report-interval=10 --db-driver=mysql run</code>

Single‑Node Failure

When one TiKV node goes down, the majority of replicas stay alive, a new leader is elected, and the system quickly recovers. The QPS drops briefly (from ~12,000 to ~4,000) during leader election, as shown in the following screenshot.

<code>tiup ctl:v5.1.1 pd -u http://10.xxxxx.173:2379 store|grep -B 10 'Disconnected'
{
  "store": {
    "id": 5,
    "address": "10.xxxx.155:20160",
    "version": "5.1.1",
    "status_address": "10.xxxx.155:20180",
    "state_name": "Disconnected"
  }
}</code>

When a node’s state changes to Disconnected , it becomes Down after the max‑store‑down‑time (default 30 minutes). Only after becoming Down will other TiKV nodes start replicating the missing replicas.

Simultaneous Failure of Two Nodes

Two TiKV nodes are brutally removed by deleting their data directories ( rm -rf tikv‑data ). The following image shows the cluster state after the failure.

Sysbench reports “tikv server timeout” and “region is unavailable”, and QPS drops to zero.

Recovery steps:

Identify the disconnected stores (IDs 1 and 6) using tiup ctl pd store .

Disable PD scheduling to avoid interference:

<code>$ tiup ctl:v5.1.1 pd -u http://10.xxxxx:2379 config set region-schedule-limit 0
$ tiup ctl:v5.1.1 pd -u http://10.xxxxx:2379 config set replica-schedule-limit 0
$ tiup ctl:v5.1.1 pd -u http://10.xxxxx:2379 config set leader-schedule-limit 0
$ tiup ctl:v5.1.1 pd -u http://10.xxxxx:2379 config set merge-schedule-limit 0</code>

Stop the remaining TiKV nodes to prevent new writes and release file locks.

<code>$ tiup cluster stop BA-xxxx_bak -R tikv</code>

On each surviving node, run tikv‑ctl unsafe‑recover remove‑fail‑stores to remove the failed stores from all regions:

<code>$ ./tikv-ctl --data-dir /data6/tikv-20180 unsafe-recover remove-fail-stores -s 6,1 --all-regions</code>

The command logs show peer changes for many regions, confirming the removal.

Re‑enable PD scheduling:

<code>$ tiup ctl:v5.1.1 pd -u http://10.xxxxx:2379 config set region-schedule-limit 2048
$ tiup ctl:v5.1.1 pd -u http://10.xxxxx:2379 config set replica-schedule-limit 64
$ tiup ctl:v5.1.1 pd -u http://10.xxxxx:2379 config set leader-schedule-limit 4
$ tiup ctl:v5.1.1 pd -u http://10.xxxxx:2379 config set merge-schedule-limit 8</code>

Restart the TiKV cluster (preferably only the three healthy nodes):

<code>$ tiup cluster start BA-analyse-tidb_shyc_bak -R tikv</code>

After restart, the cluster returns to full availability; sysbench shows the original QPS and table row counts (500,000 rows each for sbtest10 and sbtest3).

<code>mysql> select count(1) from sbtest10;
+----------+
| count(1) |
+----------+
|   500000 |
+----------+

mysql> select count(1) from sbtest3;
+----------+
| count(1) |
+----------+
|   500000 |
+----------+</code>

Simultaneous Failure of Three Nodes

Three TiKV servers are shut down to simulate a hardware outage. The recovery procedure mirrors the two‑node case but with additional steps to handle regions that lost all three replicas.

Identify the three disconnected stores (IDs 1, 2, 8).

Disable PD scheduling.

List regions where a majority of peers reside on the failed stores using a jq filter.

<code>tiup ctl:v5.1.1 pd -u http://10.xxxxx:2379 region --jq='.regions[] | {id: .id, peer_stores: [.peers[].store_id] | select(length as $total | map(if .==(2,8,1) then . else empty end) | length>=$total-length)}'</code>

Regions 52, 46, 175, 58 have lost all three replicas and their data is unrecoverable; other regions have only one remaining peer.

Stop the two surviving TiKV nodes.

<code>tiup cluster stop -N <remaining‑tikv‑nodes></code>

Remove the failed stores from all regions on the surviving nodes:

<code>./tikv-ctl --data-dir /data6/tikv-20180/ unsafe-recover remove-fail-stores -s 2,8,1 --all-regions</code>

Recreate empty regions for the lost ones (one at a time):

<code>./tikv-ctl --data-dir /data6/tikv-20180/ recreate-region -p 10.xxxxx:2379 -r 52
./tikv-ctl --data-dir /data6/tikv-20180/ recreate-region -p 10.xxxxx:2379 -r 46</code>

Re‑enable PD scheduling and restart the two healthy TiKV nodes.

<code>$ tiup ctl:v5.1.1 pd -u http://10.xxxxx:2379 config set region-schedule-limit 2048
... (same as previous re‑enable commands) ...
$ tiup cluster start BA-analyse-tidb_shyc_bak -N <healthy‑nodes></code>

After these steps, the majority of regions become healthy again, though tables sbtest5 and sbtest6 have lost roughly half of their rows.

<code>mysql> select count(1),count(c) from sbtest5;
+----------+----------+
| count(1) | count(c) |
+----------+----------+
|   280555 |   280555 |
+----------+

mysql> select count(1),count(c) from sbtest6;
+----------+----------+
| count(1) | count(c) |
+----------+----------+
|   247312 |   247312 |
+----------+</code>

Summary and Reflections

Summary

The article explains the behavior of a TiKV cluster when one, two, or three nodes fail and provides a practical recovery workflow using tikv‑ctl and PD configuration tweaks. With three‑replica Raft groups, a single node failure is handled automatically, two nodes require manual intervention, and three nodes can cause irreversible data loss for some regions.

Thoughts

When multiple TiKV nodes become unavailable, front‑end load balancers (e.g., HAProxy/LVS) should be taken offline to stop new writes.

Increasing the replica count (e.g., to five) can improve tolerance against simultaneous node failures.

Disabling PD scheduling during recovery prevents inconsistent metadata.

Recreating empty regions may affect index consistency; verify data correctness after recovery.

Regular TiDB backups (BR) are essential for disaster recovery.

Active‑active architectures (e.g., TiCDC‑based multi‑region setups) provide higher availability.

Future work may explore automated chaos‑engineering tools to test these scenarios more systematically.

TiDBData LossRaftTiKVCluster RecoveryPD
Xiaolei Talks DB
Written by

Xiaolei Talks DB

Sharing daily database operations insights, from distributed databases to cloud migration. Author: Dai Xiaolei, with 10+ years of DB ops and development experience. Your support is appreciated.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.