Databases 13 min read

End-to-End Data Consistency Verification for MySQL in DTS

The Vivo Internet Storage R&D team's article describes an end‑to‑end MySQL data‑consistency verification tool for DTS that uses fixed‑size chunking and CRC32/MD5 fingerprint aggregation to quickly compare source and target tables, pinpoint mismatched rows, and enable automated or manual correction while minimizing impact on replication.

vivo Internet Technology

Jul 3, 2024

End-to-End Data Consistency Verification for MySQL in DTS

This article, authored by the Vivo Internet Storage R&D team, introduces the data‑consistency verification tool used in the Data Transmission Service (DTS) for MySQL synchronization and shares its implementation ideas.

Background : During MySQL usage, data replication can be affected by cluster splitting, data transfer, aggregation, uncontrolled writes, unknown service issues, or human errors, making it difficult to guarantee that replicated data matches the source. Besides improving processes and service reliability, a fast and effective verification mechanism is needed to locate inconsistent data for possible automatic retry or manual correction.

For DTS‑MySQL, two verification requirements are highlighted:

End‑to‑end external data verification from source cluster to target cluster.

Internal verification to ensure the correctness of synchronized data.

Consistency verification principle : Compare the data produced on the target side with the original data on the source side and report any mismatched data blocks without affecting the synchronization process or the business database.

The verification must cover table structures, rows, indexes, views, stored procedures, etc., be minimally invasive, complete quickly, and precisely locate inconsistent blocks.

Challenges of full‑row comparison :

Long execution time for large tables.

Poor timeliness for incremental scenarios.

Partial verification reduces time but may miss some inconsistencies and has coverage trade‑offs.

Solution overview : Adopt a chunk‑based approach combined with feature calculation (CRC32/MD5) to balance accuracy and performance.

1. Data chunking : Retrieve rows in fixed‑size chunks (e.g., 10 rows per chunk). Example query:

SELECT * FROM sbtest1 FORCE INDEX(`PRIMARY`) WHERE ((`id`>= 1) AND ((`id`) <= 10));

The result set is shown in the original article (truncated for brevity).

2. Column‑level aggregation : Compute a CRC32 for each row by concatenating the primary key, column k, and the CRC32 of columns c and pad:

SELECT CONCAT_WS('#', `id`, `k`, CRC32(`c`), CRC32(`pad`), CONCAT(ISNULL(`pad`))) AS CRC FROM `sbtest1` FORCE INDEX(`PRIMARY`) WHERE ((`id`>= 1) AND ((`id`) <= 10));

Sample output: 1#3230682#4041434652#3764881225#0 3. Row‑level aggregation : Combine the per‑row CRCs into a single value for the whole chunk to reduce size:

SELECT CRC32(CONCAT_WS('#', `id`, `k`, CRC32(`c`), CRC32(`pad`), CONCAT(ISNULL(`pad`)))) AS CRC FROM `sbtest1` FORCE INDEX(`PRIMARY`) WHERE ((`id`>= 1) AND ((`id`) <= 10));

Result example: 501470676 Or concatenate all row CRCs:

SELECT GROUP_CONCAT(CRC32(CONCAT_WS('#', `id`, `k`, CRC32(`c`), CRC32(`pad`), CONCAT(ISNULL(`pad`)))) AS CRC FROM `sbtest1` FORCE INDEX(`PRIMARY`) WHERE ((`id`>= 1) AND ((`id`) <= 10));

Yielding a comma‑separated list of CRCs.

4. Feature calculation : The chunk’s CRC list can be further compressed using another CRC32 or MD5, producing a compact fingerprint for comparison:

SELECT CRC32(GROUP_CONCAT(CRC32(CONCAT_WS('#', `id`, `k`, CRC32(`c`), CRC32(`pad`), CONCAT(ISNULL(`pad`))))) ) AS CRC FROM `sbtest1` ...;

SELECT MD5(GROUP_CONCAT(CRC32(CONCAT_WS('#', `id`, `k`, CRC32(`c`), CRC32(`pad`), CONCAT(ISNULL(`pad`))))) ) AS md5 FROM `sbtest1` ...;

These fingerprints are compared between source and target. If they differ, the corresponding chunk is split further to pinpoint the exact rows that are inconsistent.

Trade‑offs :

Larger chunks → faster verification but lower feature precision.

Smaller chunks → higher precision but higher load on source/target.

Potential issues include hash collisions (different data yielding the same fingerprint) and reliance on primary‑key ordering, which may not work when target primary keys are overwritten.

Final consistency handling : When a mismatch is detected, the system can re‑verify the problematic chunk during low‑traffic windows (e.g., within 1 second of binlog delay) and either record all differing rows or sample them. Re‑verification after a time interval confirms eventual consistency.

Summary and reflections : The described approach solves over 99 % of verification needs for both full‑load and incremental scenarios. Remaining improvement areas are reducing false‑positive collisions and extending support to heterogeneous databases (e.g., MySQL ↔ PostgreSQL, Redis ↔ KV stores).

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

mysql DTS checksum Chunking CRC32 Database Synchronization

Written by

vivo Internet Technology

Sharing practical vivo Internet technology insights and salon events, plus the latest industry news and hot conferences.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.