Big Data 16 min read

How to Speed Up Full‑Scale Data Comparison for Massive Migration Projects

This article details the challenges of comparing billions of rows during large‑scale data migrations, presents a multi‑step solution using union‑all grouping, hash‑based aggregation, and intelligent primary‑key detection, and explains platform features, performance optimizations, and future enhancements that reduced comparison time by up to 70%.

DeWu Technology
DeWu Technology
DeWu Technology
How to Speed Up Full‑Scale Data Comparison for Massive Migration Projects

Background

DeWu has accumulated over 10 million compute tasks and more than 200 PB of data. To cut costs, the company migrated its compute engine from a cloud Spark service to a self‑built Spark cluster and moved storage from ODPS to OSS. Ensuring data consistency before and after migration became critical, and manual row‑by‑row comparison was infeasible.

Key Challenges and Goals

Speed: Need to complete full‑text data comparison much faster.

Accuracy: Precisely locate abnormal data.

Core objectives for the comparison platform were:

High‑concurrency processing for thousands of daily tasks (10 + 100k tasks, tens of thousands of tables).

Fully automated comparison workflow.

Intelligent diff localization with field‑level highlighting.

Visual analysis UI supporting large JSON structures.

Performance improvement from hour‑level to minute‑level per task.

Horizontally scalable architecture.

Solution Architecture

1. Fast Full‑Text Comparison via UNION‑ALL Grouping

Data from tables a and b are combined with a flag indicating the source, then aggregated:

SELECT 1 AS _t1_count, 0 AS _t2_count, id, name, age, score FROM a
UNION ALL
SELECT 0 AS _t1_count, 1 AS _t2_count, id, name, age, score FROM b

After the union, the platform runs:

SELECT SUM(_t1_count) AS sum_t1_count,
       SUM(_t2_count) AS sum_t2_count,
       id, name, age, score
FROM ( ...union query... ) AS union_table
GROUP BY id, name, age, score

If sum_t1_count differs from sum_t2_count, the row is written to a diff table for further analysis.

2. Hash‑Based Aggregation for Early Filtering

Because ~70 % of tasks are identical, the platform first computes a hash sum for each column to quickly filter out matching tables:

SELECT COUNT(*),
       SUM(xxhash64(col1) ^ xxhash64(col2) ^ ...) FROM tableA
EXCEPT
SELECT COUNT(*),
       SUM(xxhash64(col1) ^ xxhash64(col2) ^ ...) FROM tableB

If the result is empty, the tables are considered identical; otherwise the UNION‑ALL method is applied to pinpoint differences.

3. Automatic Primary‑Key Discovery

Accurate diff localization requires a primary key. The platform first attempts a hash‑sum comparison of each column to find candidates (e.g., id, name). If that fails, it falls back to distinct‑count checks on the first N columns, iterating until a unique combination is found. When both methods fail, a full‑field sort is performed and the first few rows are examined manually.

4. Platform Features

Basic Comparison Flow: Data ingestion → pre‑validation → comparison → diff extraction → root‑cause analysis.

Task Generation Modes: Direct table‑to‑table comparison, task‑node comparison (auto‑extracts all output tables from a Spark job), and SQL‑query comparison (executes the same query on both engines and compares results).

Pre‑validation Checks: Metadata consistency, function availability, syntax validation, and other schema checks to avoid wasted compute.

Resource & Scheduling Optimizations: Separate queues per business line, dedicated root‑cause analysis queue, priority‑based batch scheduling, and tuned Spark parameters.

Performance Improvements

By introducing hash‑based early filtering, the average per‑task comparison time dropped from ~500 s to ~160 s (≈70 % reduction). Resource‑queue isolation and priority scheduling moved daily completion time from 22:00 to 18:00 and cut primary‑key detection latency from 58.5 s to 26.2 s.

Benefits

Stable operation for >500 days, handling >2 000 tasks per day and >1.28 million comparisons with zero false positives.

Saved >45 person‑days per month for the compute migration team.

Enabled >20 % of storage data migration and >80 % of SDK migrations.

Future Directions

Integrate large‑model AI for JSON‑nested field root‑cause analysis.

Automatic table‑splitting strategies for massive tables.

Codify common comparison patterns into reusable solutions across teams.

data comparisonhash aggregationprimary key detection
DeWu Technology
Written by

DeWu Technology

A platform for sharing and discussing tech knowledge, guiding you toward the cloud of technology.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.