Databases 12 min read

How to Recover a Failed TiDB PD Cluster with pd-recover: Step‑by‑Step Guide

This article walks through a real‑world TiDB PD cluster outage, explains how to diagnose the failure, retrieve necessary IDs, install and use the pd‑recover tool, and finally restore the cluster to a healthy state with detailed commands and screenshots.

Xiaolei Talks DB
Xiaolei Talks DB
Xiaolei Talks DB
How to Recover a Failed TiDB PD Cluster with pd-recover: Step‑by‑Step Guide

Disaster Description

On Sunday a colleague reported that an online TiDB business line's cluster was unavailable. The TiDB/PD cluster was a mixed deployment with three PD nodes, but only one PD node remained alive, rendering the whole cluster unusable.

Many PD/TiDB instances were down and TiKV was unavailable, causing panic.

The cluster had been down for a while; attempts such as cluster restart, forced PD removal, and PD scaling were made without success.

On‑Site Inspection

Only PD1 responded to ping and its process was running. PD2 was unreachable, and PD3 could be logged into but failed to start the service (tiup start hung, manual start reported errors).

Problem Analysis

In a three‑node PD cluster, a majority of PD nodes must be alive. The goal was to bring at least one of PD2 or PD3 back online.

Two approaches were taken:

Trigger an automatic restart task on the unreachable PD2 server, hoping the other PD would bring the majority back up.

Attempt to restart PD3, which had been cleaned with tiup scale‑in --force but still retained its data directory and binaries. The startup log showed an "etcd cluster ID mismatch" error.

The error was resolved by removing the incorrect member from the --initial-cluster parameter. Meanwhile, PD2 started successfully.

"Solution"

After PD2 recovered, the cluster became functional, but a more robust recovery method was needed for cases where multiple PD nodes fail.

PD Cluster Recovery Tool: pd‑recover

pd‑recover is a disaster‑recovery utility for PD clusters that can restore PD nodes that cannot start normally.

Installation

wget https://download.pingcap.org/tidb-v5.3.0-linux-amd64.tar.gz tar -zxvf tidb-v5.3.0-linux-amd64.tar.gz cd tidb-v5.3.0-linux-amd64/bin/

Obtaining the Cluster ID

The earliest Cluster ID can be found in PD, TiKV, or TiDB logs. Example command:

cat pd.log | grep "init cluster id"

If PD logs are unavailable, the same query can be run against TiDB or TiKV logs.

Getting the Alloc‑ID

Alloc‑ID must be larger than the current maximum allocated ID. It can be retrieved from Prometheus:

Or by running on each PD node:

cat pd*.log | grep "idAllocator allocates a new id" | awk -F'=' '{print $2}' | awk -F']' '{print $1}' | sort -r -n | head -n 1 Result example: 1609000 Multiply by 100 to obtain the alloc‑id (e.g., 160900000).

Deploying a New PD Cluster (Optional)

If all original PD nodes are healthy, you can delete their data directories and restart them, which effectively creates a new PD cluster without adding new machines.

systemctl stop pd-2379.service mv /data/pd-2379 /data/bak/pd2379 tiup cluster start zhanshi_2 -R pd

Using pd‑recover to Restore the Cluster

./pd-recover -endpoints http://pd-id:2379 -cluster-id 6917597403461510168 -alloc-id 160900000 recover success! please restart the PD cluster

Restart the Whole Cluster

After the success message, restart the entire TiDB cluster and verify the status.

Common Issues

Multiple Cluster IDs found: use the earliest one from logs for pd‑recover.

pd‑recover fails because PD is not running: ensure the PD service is deployed and started before running pd‑recover.

{"level":"warn","ts":"2022-01-25T13:36:48.549+0800","msg":"retrying of unary invoker failed","error":"rpc error: code = DeadlineExceeded desc = latest balancer error: all SubConns are in TransientFailure, latest connection error: transport: Error while dialing dial tcp XXX:2379: connect: connection refused"} context deadline exceeded

Reflection

1. If the cluster is blocked by problematic SQL, consider using VIP throttling or setting max_execution_time to quickly relieve pressure.

2. Avoid panic‑driven actions; plan multiple solutions before executing.

3. Record all operations for post‑mortem analysis.

4. Regularly report slow SQL and perform pre‑deployment reviews.

5. For critical clusters, ship logs to external storage (e.g., S3) to ensure availability during outages.

The article focuses on recovering a PD majority failure; future posts will address scenarios where most TiKV replicas are unavailable.

TiDBDatabase OperationsCluster RecoveryPDpd-recover
Xiaolei Talks DB
Written by

Xiaolei Talks DB

Sharing daily database operations insights, from distributed databases to cloud migration. Author: Dai Xiaolei, with 10+ years of DB ops and development experience. Your support is appreciated.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.