Databases 21 min read

Mastering TiDB Upgrades: Strategies, Risks, and Real‑World Case Studies

This comprehensive guide explains why TiDB upgrades are essential, compares offline, forced, and smooth upgrade methods, outlines detailed step‑by‑step procedures, highlights advantages and risks, and shares practical check‑points and real‑world case studies to help DBAs ensure a safe, performant migration.

Xiaolei Talks DB

Aug 8, 2022

Mastering TiDB Upgrades: Strategies, Risks, and Real‑World Case Studies

Why Upgrade?

Older TiDB components lack full support (e.g., CDC memory leaks).

Encountered TiDB panic bugs in production.

Need higher performance and throughput under cost‑reduction pressure.

Contribute real business scenarios to the community.

How to Upgrade?

TiDB upgrades can be performed as offline upgrades or online upgrades . Online upgrades are further divided into smooth upgrades and forced upgrades (using the –force flag).

Forced Upgrade

Download the target version binaries to each host.

Upgrade component pd.

Upgrade component tikv (replace binaries, restart tikv without leader election).

Upgrade component tidb (sessions on the node will be disconnected).

Upgrade monitoring components (Prometheus, Grafana, Alertmanager).

Advantages: Short upgrade time (e.g., restarting a 200 GB tikv node takes ~1 minute) and controllable results.

Risks: If the leader being restarted receives client requests, read/write failures may occur; a reconnection mechanism is required for step 4 failures.

Smooth Upgrade

Download the target version binaries.

Upgrade component pd.

Upgrade component tikv (leader is evicted with a default 10‑minute timeout, then the node restarts).

Upgrade component tidb (all sessions on the node are disconnected).

Upgrade monitoring components.

Advantages: Most business read/write operations remain successful because leader eviction is completed before restart.

Risks: Longer upgrade duration (e.g., 3 pd + 3 tidb + 100 tikv may take ~16 hours) and possible mixed‑version cluster state if a step fails.

Offline Upgrade

Shut down the entire TiDB cluster (all components).

Replace binaries.

Start the TiDB cluster.

Risk: Service downtime during the upgrade.

Core Issues Before/After Upgrade

Finding a suitable rollback plan.

SQL compatibility problems.

Performance regression concerns.

1. Rollback Plan

TiDB does not provide a native rollback. Common approaches include:

Backup/restore using br (low operational cost but long recovery time).

TiDB replica (near‑real‑time sync, requires ETL tools).

Dual‑write architecture (business controls writes, but transaction atomicity may be broken).

No rollback.

Backup/Restore

Use br to back up to local storage or S3, then restore. Suitable when CDC support is insufficient.

TiDB Replica

Recommended when CDC is mature; use tools like pump or ticdc for near‑real‑time sync.

Dual‑Write

Business writes to both old and new clusters; DBA involvement is minimal, but transaction consistency must be handled by the application.

2. SQL Compatibility

Three practical methods (“three‑axe”):

Replay traffic with mysql‑replay (capture packets via tcpdump, replay on the target cluster, monitor error logs).

Read TiDB changelog and focus on compatibility changes and bug fixes.

Involve business teams to test the new version in a staging environment.

3. Performance Regression

Two mitigation strategies:

Manually collect statistics for tables with low health before upgrade.

Export and re‑bind execution plans (e.g., mysql.bind_info schema changes require re‑binding after upgrade).

Upgrade Check Points

1. Preparation

Test upgrade duration for each method (offline, forced, smooth).

Research and verify rollback plans.

Check host health (CPU, memory, disks, etc.).

SQL compatibility checks (mysql‑replay, changelog, business testing).

Prevent performance regression (statistics collection, plan export).

Upgrade tiup tool itself.

Communicate with vendor engineers and business owners.

Accumulate upgrade experience on non‑critical clusters.

2. Execution

Collect cluster metrics using tiup diag collect for before/after comparison.

Confirm rollback plan is deployed and validated.

Notify business to stop DDL operations during upgrade.

Ensure upstream/downstream services are ready.

Temporarily disable automatic statistics collection.

Export current execution plans (e.g., sqlbind.sql).

Check region replica health with pd‑ctl region.

Pause region scheduling ( pd‑ctl config set …‑schedule‑limit 0).

Verify SSH connectivity to all nodes.

Perform the upgrade with tiup cluster upgrade … v5.x.x (add --force for forced upgrade).

Import execution plans and verify.

Re‑enable automatic statistics collection.

Set performance‑enhancing parameters (e.g., tidb_enable_async_commit, tidb_enable_1pc, gc.enable-compaction-filter).

Validate read/write functionality with business services.

Confirm performance meets expectations.

Backup the new version using br to keep tools and data in sync.

Collect post‑upgrade cluster information with the clinic tool.

3. Follow‑up

Update backup scripts for the new version.

Adjust slow‑log collection scripts (regex changes between v4 and v5).

Case Studies

Case 1 – PD Node Caused Upgrade Failure

During an upgrade, a 4‑node PD cluster failed because one node could not become leader, leading to only half the votes and upgrade abort.

Solution: Reduce PD count (shrink) and retry the upgrade, which then succeeded.

Case 2 – Large‑Scale Cluster A Upgrade

Cluster A handles >5 w QPS, >2 w TPS, with tables approaching 100 billion rows and 99.99 % SLA. Upgrade was needed for better CDC support.

Chosen method: Forced upgrade for short, controllable downtime.

Challenges addressed:

Massive cluster size (20+ TiDB, 40+ TiKV nodes).

Short maintenance window (no downtime allowed).

Rollback via backup/restore (initially 86 hours, later reduced to ~5 hours with more TiKV nodes).

SQL compatibility using the three methods above.

Performance regression mitigated by statistics collection and plan re‑binding.

Results: 97 % query latency reduction (4 s → 125 ms), CPU usage stayed below alert thresholds, write performance improved due to async commit and 1PC, and GC in compaction filter reduced performance jitter.

Final Thoughts

Database upgrades are critical projects that require coordinated effort across DBAs and business teams. While tiup cluster upgrade automates the technical steps, ensuring business acceptance, thorough pre‑checks, clear rollback strategies, and effective communication are essential to avoid “black‑swan” incidents and achieve a successful migration.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

performance database TiDB Upgrade rollback

Written by

Xiaolei Talks DB

Sharing daily database operations insights, from distributed databases to cloud migration. Author: Dai Xiaolei, with 10+ years of DB ops and development experience. Your support is appreciated.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.