How TiDB Achieves Multi-Active High Availability Across Multiple Data Centers
This article explains TiDB's multi‑active high‑availability architectures—including same‑city dual‑center, triple‑center, and two‑region three‑center deployments—detailing hard requirements, RPO/RTO goals, placement‑rule configurations, and practical disaster‑recovery recommendations for distributed database clusters, and how adaptive sync modes affect failover performance.
TiDB multi‑active is a high‑availability solution sought by internet companies using TiDB in core scenarios. To meet distributed database availability, multi‑center deployment (same‑city primary‑backup, same‑city dual‑center, triple‑center, two‑region three‑center, etc.) is used.
Same‑city multi‑center hard requirements:
Data centers within 50 km, usually in the same city or neighboring cities.
Two fiber links between data centers, latency < 1.5 ms, stable long‑term.
Dual links with bandwidth > 10 Gbps.
Key concepts:
RPO : Recovery Point Objective – amount of data loss tolerated; core services often require RPO = 0.
RTO : Recovery Time Objective – maximum acceptable downtime; typically RTO < 30 s.
Multi‑raft : TiKV uses Raft replication; each region has default 3 replicas, and writes succeed when a majority of raft group nodes acknowledge.
Recommendations for multi‑active scenarios:
Server‑level disaster recovery: at least 3 servers.
Rack‑level disaster recovery: at least 3 racks.
Data‑center disaster recovery: at least 3 data centers.
City‑level disaster recovery: at least 3 cities.
Primary‑Backup Cluster
Based on TiCDC, same‑city or cross‑city data sync enables a standby cluster that can offload read traffic and take over when the primary data center fails. RPO may be non‑zero due to TiCDC sink speed, but RTO can be kept under 30 s via automated scripts.
Note: This article focuses on Placement‑rule based multi‑center solutions; asynchronous TiCDC replication is not detailed. For deeper TiCDC information, see my previous article “TiCDC Application Scenarios”.
Same‑City Dual‑Center
TiDB 4.0 introduced Placement‑rule clusters; TiDB 5.4 added Data Replication Auto‑Synchronous (DR Auto‑Sync) for dual‑center deployment. Architecture: two IDC rooms in the same city – primary IDC and DR IDC.
Configuration:
Four replicas: primary IDC holds 2 voter replicas, DR IDC holds 1 voter and 1 learner; leader resides in primary IDC.
Adaptive sync switches among sync, async, and sync‑recover modes.
In sync mode, at least one replica in DR IDC stays synchronized, achieving RPO = 0; failover requires tikv-ctl , so RTO may be > N.
If network fault exceeds PD’s wait‑store‑timeout (default 60 s), replication switches to async; primary IDC writes succeed, DR IDC may lag (RPO = 0 & RTO = 0 for reads).
When the fault recovers, sync‑recover restores full synchronization, and PD eventually switches replication state back to sync.
Features:
Sync mode provides RPO = 0 and RTO ≈ 0.
In async mode, RPO ≠ 0 and RTO is uncontrolled, requiring manual tikv-ctl promotion.
Same‑City Triple‑Center
Deploy a TiDB cluster across three IDC rooms in the same city. All three data centers run TiDB/PD/TiKV and can serve read/write traffic. Using default 3‑replica Raft, any two centers hold the latest data, guaranteeing RPO = 0 and RTO < 30 s.
Suitable for read‑heavy, write‑light workloads; write latency increases due to cross‑center replication (≈3 ms). Read latency may also be affected if the leader resides in another center.
Two‑Region Three‑Center
Two local read/write data centers plus a remote disaster‑recovery center provide city‑level HA. Example: Beijing IDC1, Beijing IDC2, and Xi’an IDC3. Five‑replica setup places two replicas in each local IDC and one in the remote IDC.
All replicas use Raft for consistency. This design ensures RPO = 0 and minute‑level RTO when a local center fails, but if both local centers fail, the remote center may not have the latest data, so RPO ≠ 0 and RTO depends on manual recovery.
Additional cost: higher storage due to five replicas.
Summary
For a single‑city dual‑IDC setup, TiCDC‑based primary‑backup offers limited multi‑active capability; true multi‑active is achieved with same‑city triple‑center deployment.
Note: This article references TiDB official solutions; all diagrams are from the TiDB website.
Xiaolei Talks DB
Sharing daily database operations insights, from distributed databases to cloud migration. Author: Dai Xiaolei, with 10+ years of DB ops and development experience. Your support is appreciated.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.