Why TiKV Scale‑In Stuck After Expansion? Diagnosis and Fix
This guide explains why a TiKV node remains pending offline after a scale‑out and scale‑in operation, walks through detailed log inspection, region checks, and command‑line troubleshooting, and provides a step‑by‑step solution to forcefully remove the problematic region and clean up the store.
TiKV node scale‑in often gets stuck; common cases include a three‑node TiKV cluster where no new node can accept the region peers of the node being removed, and high disk usage hitting the schedule.high-space-ratio=0.6 limit, preventing region migration.
When trying to shrink a three‑node TiKV cluster, the operation hangs because there is no new node to receive the region peers of the node being taken offline.
Even after scaling out and then scaling in, the node may stay in pending offline state for days.
Problem Scenario
Version: TiDB v5.2.1 with TiFlash nodes, upgraded from 3.x to 5.2.1. After adding a new TiKV node on the 24th and then attempting to offline an existing node, the node remained in pending offline for two days despite leader and region movement being observed.
The monitoring screenshots show that both scaling out and scaling in have completed.
Investigation
(1) Check TiKV node logs for errors.
The logs contain KvService::batch_raft send response fail errors, which in TiKV 4.x were caused by oversized Raft messages exceeding gRPC limits, blocking region scheduling. Reducing raft-max-size-per-msg can help, but the cluster is already on 5.2.1, so this fix does not apply.
<code>$ grep 'ERROR' tikv.log
[2022/03/28 09:34:38.062 +08:00] [ERROR] [kv.rs:729] ["KvService::batch_raft send response fail"] [err=RemoteStopped]
[2022/03/28 09:34:38.227 +08:00] [ERROR] [pd.rs:83] ["Failed to send read flow statistics"] [err="channel has been closed"]
[2022/03/28 09:34:55.711 +08:00] [ERROR] [server.rs:1030] ["failed to init io snooper"] [err_code=KV:Unknown] [err="\"IO snooper is not started due to not compiling with BCC\""]</code>(2) Inspect the store status; the problematic store shows state_name: "Offline" with zero leaders and regions, yet it remains pending offline.
<code>tiup ctl:v5.2.1 pd -u http://pd-ip:2379 store 5
{ "store": { "id": 5, "state_name": "Offline", ... }, "status": { "leader_count": 0, "region_count": 0, ... } }</code>(3) Query the region information for the store.
<code>$ tiup ctl:v5.2.1 pd -u http://pd-ip:2379 region store 5
{ "count": 1, "regions": [ { "id": 434317, "peers": [ {"store_id":1,"role_name":"Voter"}, {"store_id":4,"role_name":"Voter"}, {"store_id":5,"role_name":"Voter"}, {"store_id":390553,"role_name":"Learner"}, {"store_id":390554,"role_name":"Learner"} ], "leader": {"role_name":"Voter"} } ] }</code>The region 434317 has no leader and contains peers on the offline store and two old TiFlash stores, preventing the store from reaching the tombstone state.
(4) Verify that the region holds no actual data.
<code>$ curl http://tidb-server-ip:10080/regions/434317
{ "region_id": 434317, "frames": null }</code>Solution
(1) Attempt to move the leader or modify peers via pd-ctl . All attempts fail because the region lacks a leader.
<code>// Transfer leader to store 4
$ tiup ctl:v5.2.1 pd -u http://pd-ip:2379 operator add transfer-leader 434317 4
Failed! [500] "cannot build operator for region with no leader"
// Add a peer on store 1094772
$ tiup ctl:v5.2.1 pd -u http://pd-ip:2379 operator add add-peer 434317 1094772
Failed! [500] "cannot build operator for region with no leader"
// Remove peer from store 5
$ tiup ctl:v5.2.1 pd -u http://pd-ip:2379 operator add remove-peer 434317 5
Failed! [500] "cannot build operator for region with no leader"</code>Since pd-ctl requires a leader, this path is blocked.
(2) Try deleting the store directly.
<code>tiup ctl:v5.2.1 pd -u http://pd-ip:2379 store delete 5
Success!</code>The store still appears in PD records, so this does not resolve the issue.
(3) Force a scale‑in with tiup cluster scale-in --force .
<code>tiup cluster scale-in dsp_report -N 10.203.93.36:20160 --force</code>The TiKV node disappears from TiUP, but PD still lists the store, and the problem persists.
(4) Finally, use tikv-ctl to tombstone the problematic region, which is safe because the region is empty.
<code>./tikv-ctl --db /data/deploy/data/db tombstone -r 434317 --force</code>After this command the region is removed and the store can finally transition to Tombstone state.
For similar issues, you can first stop the scale‑in, upgrade to a newer TiDB version, and then retry the operation. To abort an ongoing scale‑in, you can reset the store state via:
<code>curl -X POST http://${pd_ip}:2379/pd/api/v1/store/${store_id}/state?state=Up</code>Store State Transitions
Up : Store is serving traffic.
Disconnect : No heartbeat for >20 seconds; after max-store-down-time it becomes Down .
Down : No connection for the configured timeout (default 30 min); PD starts replicating regions to other stores.
Offline : Manually taken offline via PD; regions are moved away. When both leader and region counts drop to 0, the store becomes Tombstone . The store must stay running during this phase.
Tombstone : Fully offline; can be safely removed with the remove-tombstone API.
Source: TiDB official documentation on scheduling and store states.
Xiaolei Talks DB
Sharing daily database operations insights, from distributed databases to cloud migration. Author: Dai Xiaolei, with 10+ years of DB ops and development experience. Your support is appreciated.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.