Databases 13 min read

Accelerating MySQL Data Repair with pt-table-sync: A Self‑Healing Solution

This article explains how the Tencent Game DBA team uses Percona's pt-table-sync to detect and automatically repair MySQL replication inconsistencies, achieving up to 30‑fold speed improvements, reducing resource usage, and enabling a data self‑healing service for large‑scale gaming databases.

ITPUB
ITPUB
ITPUB
Accelerating MySQL Data Repair with pt-table-sync: A Self‑Healing Solution

1. Introduction

Data from two gray‑scale services (A: 330 GB, B: 93 GB) shows that using pt-table-sync reduces repair time from 150 minutes to 5 minutes for A and from 35 minutes to 3 minutes for B, a 30‑fold and 12‑fold speed‑up respectively, illustrating the benefit of automated data self‑healing.

2. Background

MySQL replication normally guarantees consistency, but unsafe statements, hardware failures, and other anomalies can cause divergence. Traditional checksum detection often requires rebuilding a hot standby, which can take more than 10 hours even when only a few rows differ. Online repair of these differences is therefore much more efficient.

3. Benefits

Shorter hot‑standby recovery time, improving data safety.

Reduced server resources by avoiding full hot‑standby rebuilds.

Higher DBA efficiency.

Lower communication overhead, ensuring stable business operation.

4. Data‑Self‑Healing Solution

The team adopted Percona’s pt-table-sync, an open‑source tool launched in 2007, and extended it for internal use.

4.1 Process Overview

Two modes exist: replicate and non‑replicate; the replicate mode is recommended. The workflow per chunk is:

Lock each chunk with FOR UPDATE and record SHOW MASTER STATUS.

On the replica, call SELECT MASTER_POS_WAIT(...) to wait until the replica reaches the same position.

Run a checksum on the chunk on both master and replica.

If checksums match, proceed to the next chunk.

If they differ, drill down to row‑level checksums, mark mismatched rows, and continue until the chunk is fully examined.

Repair all mismatched rows before moving to the next chunk.

Example master query:

SELECT /*water2.t:1/1*/ 0 AS chunk_num, COUNT(*) AS cnt, COALESCE(LOWER(CONV(BIT_XOR(CAST(CRC32(CONCAT_WS('#', id, name, CONCAT(ISNULL(name)))) AS UNSIGNED)), 10, 16)), 0) AS crc FROM water2.t FORCE INDEX (PRIMARY) WHERE (1=1) AND ((1=1)) FOR UPDATE

Replica wait command: SELECT MASTER_POS_WAIT('binlog3306.000014', 139672350, 60) Row‑level checksum example:

SELECT /*rows in chunk*/ id, name, CRC32(CONCAT_WS('#', id, name, CONCAT(ISNULL(name)))) AS __crc FROM water2.t FORCE INDEX (PRIMARY) WHERE (1=1) AND (1=1) ORDER BY id FOR UPDATE;

4.2 Checksum Algorithms

Two levels are used: row‑level (CRC32 of concatenated column values) and chunk‑level (CRC32 of concatenated row checksums). The team improved chunk splitting by customizing recursive_dynamic_calculate_chunks to produce evenly sized chunks, reducing lock time to about one second per chunk.

4.3 Replicate Mode and Latency Control

Replicate mode generates compensation SQL on the master, writes it to the binlog, and replays on the replica, ensuring safe data repair. To avoid executing potentially risky statements on the master, the team modified get_change_dbh so the final compensation SQL runs only on the replica.

4.4 Index Requirements

Repair uses REPLACE INTO, which requires a primary or unique key. When a table lacks such a key, the team falls back to a delete‑plus‑replace strategy on the replica.

4.5 Compatibility and Extensions

The solution integrates features from both mk-table-checksum and pt-table-checksum, preserving the latest pt-table-sync functionality while adding custom chunk‑splitting and timeout handling.

4.6 Timeout Handling

The built‑in --wait parameter is ineffective in both modes. The team replaced it with an external parameter that dynamically controls timeout behavior.

4.7 Large‑Table Difference Repair

For tables larger than 10 GB, full checksum is costly. The team disabled the mutual exclusivity of --where and replicate mode, allowing targeted repairs based on pre‑computed checksum tables.

5. Conclusion

Data self‑healing extends traditional checksum verification into an automated repair service, reducing downtime, resource consumption, and DBA workload while improving data safety for large‑scale gaming databases.

本文转自:腾讯游戏DBA团队

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

mysqldata replicationDBAchecksumpt-table-syncdatabase self-healing
ITPUB
Written by

ITPUB

Official ITPUB account sharing technical insights, community news, and exciting events.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.