Databases 13 min read

TiDB Cluster Splitting: Full Backup, Binlog Incremental Sync, and Migration Strategy

This article details a comprehensive TiDB cluster splitting project, covering background, challenges, backup and restore tools, multi‑stage migration steps, binlog incremental synchronization, CDC integration, and practical tips to ensure data consistency and minimal service impact.

Aikesheng Open Source Community

May 9, 2022

TiDB Cluster Splitting: Full Backup, Binlog Incremental Sync, and Migration Strategy

Background : To improve TiDB availability, a large‑scale deployment with hundreds of TiDB clusters needed to be split into two independent sets, involving multiple versions (4.0.9, 5.1.1, 5.1.2) and data volumes up to 62 TB.

Challenges included heterogeneous TiDB versions, massive data sizes (several clusters >10 TB), diverse usage patterns (direct TiDB reads/writes, MySQL‑>TiDB aggregation, TiDB‑>CDC‑>Hive), full‑backup impact on peak traffic, and guaranteeing consistency during the split.

Solution Overview : The official TiDB synchronization tools were evaluated:

DM full + incremental (only MySQL‑>TiDB)

BR full physical backup + CDC incremental

BR full physical backup + binlog incremental (Pump + Drainer)

BR (Backup & Restore) and TiDB Binlog were selected for a four‑stage migration.

Phase 1 – Preparation

1. Clean up obsolete data (e.g., tables older than three months).

2. Upgrade all 15 existing TiDB clusters to version 5.1.2, consolidating the split‑method across versions.

Phase 2 – Deploy New Clusters

1. Deploy new TiDB clusters (v5.1.2) on fresh machines.

set @@global.tidb_analyze_version = 1;

2. Mount NFS on both source and target clusters and install BR on PD nodes.

External storage uses Tencent Cloud NFS with auto‑scaling and rate‑limited backup to mitigate bandwidth pressure.

3. Deploy three dedicated machines for the 12 new TiDB clusters.

Pump collects binlog per cluster; Drainer runs on a dedicated 16C/32G machine for maximum incremental performance.

server_configs:
  tidb:
    binlog.enable: true
    binlog.ignore-error: true

4. Extend Pump GC retention to 7 days.

pump_servers:
  - host: xxxxx
    config:
      gc: 7
#需reload重启tidb节点使记录binlog生效

5. Perform full backups to NFS with rate limiting.

mkdir -p /tidbbr/0110_dfp
chown -R tidb.tidb /tidbbr/0110_dfp
#限速进行全业务应用库备份
./br backup full \
     --pd "xxxx:2379" \
     --storage "local:///tidbbr/0110_dfp" \
     --ratelimit 80 \
     --log-file /data/dbatemp/0110_backupdfp.log
#限速进行指定库备份
./br backup db \
   --pd "xxxx:2379" \
   --db db_name \
   --storage "local:///tidbbr/0110_dfp" \
   --ratelimit 80 \
   --log-file /data/dbatemp/0110_backupdfp.log

6. Export user/password information from the source clusters using pt-show-grants (BR does not backup MySQL system databases).

7. Restore the full backup to each new TiDB cluster, ensuring sufficient disk space (new clusters may consume a few terabytes more due to LZ4 compression differences).

Phase 3 – Incremental Synchronization & CDC

1. Verify data consistency after full and incremental sync; monitor drainer TSO lag via PD or Grafana.

#延迟检查方法一：在源端TiDB drainer状态中获取最新已经回复TSO再通过pd获取延迟情况
mysql> show drainer status;
+-------------------+-------------------+--------+--------------------+-------
| NodeID            | Address           | State  | Max_Commit_Ts      | Update_Time |
+-------------------+-------------------+--------+--------------------+-------
| xxxxxx:8249       | xxxxxx:8249       | online | 430547587152216733 | 2022-01-21 16:50:58 |

tiup ctl:v5.1.2 pd -u http://xxxxxx:2379 -i tso 430547587152216733;

2. Deploy TiFlash replicas on the new clusters.

SELECT * FROM information_schema.tiflash_replica WHERE TABLE_SCHEMA = '<db_name>' and TABLE_NAME = '<table_name>';
SELECT concat('alter table ',table_schema,'.',table_name,' set tiflash replica 1;') FROM information_schema.tiflash_replica where table_schema like 'dfp%';

3. Set up CDC pipelines to Kafka and DRC‑TiDB for bidirectional sync.

4. Configure drainer workers per database to avoid lag; three drainers in parallel achieved ~12 k TPS target write throughput versus ~6 k TPS source.

#从备份文件中获取全量备份开始时的位点TSO
grep "BackupTS=" /data/dbatemp/0110_backupdfp.log
430388153465177629
#第一次一个drainer进行增量同步关键配置
drainer_servers:
- host: xxxxxx
  commit_ts: 430388153465177629
  deploy_dir: "/data/tidb-deploy/drainer-8249"
  config:
    syncer.db-type: "tidb"
    syncer.to.host: "xxxdmall.db.com"
    syncer.worker-count: 550

#第二次多个drainer进行并行增量同步
drainer_servers:
- host: xxxxxx
  commit_ts: 430505424238936397
  config:
    syncer.replicate-do-db: [db1,db2,....]
    syncer.db-type: "tidb"
    syncer.to.host: "xxxdmall.db.com"
    syncer.worker-count: 550
    syncer.to.checkpoint.schema: "tidb_binlog2"

Phase 4 – Cut‑over & Cleanup

1. Update DNS to point applications to the new TiDB compute nodes.

2. Batch‑kill old‑cluster connections to free resources.

3. Remove the old‑to‑new incremental drainer links, noting that shared drainer machines affect node_exporter alerts across clusters.

4. Verify final consistency, run ANALYZE TABLE on new clusters, and ensure all monitoring dashboards reflect the new topology.

Summary

Standardizing all clusters to the same TiDB version simplifies splitting and reduces operational cost.

Ensure target clusters have ample disk space for restored data.

When source write pressure is high, use per‑database parallel drainer workers to keep binlog lag low.

Pre‑execute as many preparatory steps as possible; the actual split window should be short.

Appreciation to the TiDB community for technical support.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

TiDB CDC BR Tool Cluster Splitting

Written by

Aikesheng Open Source Community

The Aikesheng Open Source Community provides stable, enterprise‑grade MySQL open‑source tools and services, releases a premium open‑source component each year (1024), and continuously operates and maintains them.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.

Phase 1 – Preparation

Phase 2 – Deploy New Clusters

Phase 3 – Incremental Synchronization & CDC

Phase 4 – Cut‑over & Cleanup

Summary

Aikesheng Open Source Community

How this landed with the community

Was this worth your time?

0 Comments

Phase 1 – Preparation

Phase 2 – Deploy New Clusters

Phase 3 – Incremental Synchronization & CDC

Phase 4 – Cut‑over & Cleanup