How to Speed Up Full‑Scale Data Comparison for Massive Migration Projects
This article details the challenges of comparing billions of rows during large‑scale data migrations, presents a multi‑step solution using union‑all grouping, hash‑based aggregation, and intelligent primary‑key detection, and explains platform features, performance optimizations, and future enhancements that reduced comparison time by up to 70%.
Background
DeWu has accumulated over 10 million compute tasks and more than 200 PB of data. To cut costs, the company migrated its compute engine from a cloud Spark service to a self‑built Spark cluster and moved storage from ODPS to OSS. Ensuring data consistency before and after migration became critical, and manual row‑by‑row comparison was infeasible.
Key Challenges and Goals
Speed: Need to complete full‑text data comparison much faster.
Accuracy: Precisely locate abnormal data.
Core objectives for the comparison platform were:
High‑concurrency processing for thousands of daily tasks (10 + 100k tasks, tens of thousands of tables).
Fully automated comparison workflow.
Intelligent diff localization with field‑level highlighting.
Visual analysis UI supporting large JSON structures.
Performance improvement from hour‑level to minute‑level per task.
Horizontally scalable architecture.
Solution Architecture
1. Fast Full‑Text Comparison via UNION‑ALL Grouping
Data from tables a and b are combined with a flag indicating the source, then aggregated:
SELECT 1 AS _t1_count, 0 AS _t2_count, id, name, age, score FROM a
UNION ALL
SELECT 0 AS _t1_count, 1 AS _t2_count, id, name, age, score FROM bAfter the union, the platform runs:
SELECT SUM(_t1_count) AS sum_t1_count,
SUM(_t2_count) AS sum_t2_count,
id, name, age, score
FROM ( ...union query... ) AS union_table
GROUP BY id, name, age, scoreIf sum_t1_count differs from sum_t2_count, the row is written to a diff table for further analysis.
2. Hash‑Based Aggregation for Early Filtering
Because ~70 % of tasks are identical, the platform first computes a hash sum for each column to quickly filter out matching tables:
SELECT COUNT(*),
SUM(xxhash64(col1) ^ xxhash64(col2) ^ ...) FROM tableA
EXCEPT
SELECT COUNT(*),
SUM(xxhash64(col1) ^ xxhash64(col2) ^ ...) FROM tableBIf the result is empty, the tables are considered identical; otherwise the UNION‑ALL method is applied to pinpoint differences.
3. Automatic Primary‑Key Discovery
Accurate diff localization requires a primary key. The platform first attempts a hash‑sum comparison of each column to find candidates (e.g., id, name). If that fails, it falls back to distinct‑count checks on the first N columns, iterating until a unique combination is found. When both methods fail, a full‑field sort is performed and the first few rows are examined manually.
4. Platform Features
Basic Comparison Flow: Data ingestion → pre‑validation → comparison → diff extraction → root‑cause analysis.
Task Generation Modes: Direct table‑to‑table comparison, task‑node comparison (auto‑extracts all output tables from a Spark job), and SQL‑query comparison (executes the same query on both engines and compares results).
Pre‑validation Checks: Metadata consistency, function availability, syntax validation, and other schema checks to avoid wasted compute.
Resource & Scheduling Optimizations: Separate queues per business line, dedicated root‑cause analysis queue, priority‑based batch scheduling, and tuned Spark parameters.
Performance Improvements
By introducing hash‑based early filtering, the average per‑task comparison time dropped from ~500 s to ~160 s (≈70 % reduction). Resource‑queue isolation and priority scheduling moved daily completion time from 22:00 to 18:00 and cut primary‑key detection latency from 58.5 s to 26.2 s.
Benefits
Stable operation for >500 days, handling >2 000 tasks per day and >1.28 million comparisons with zero false positives.
Saved >45 person‑days per month for the compute migration team.
Enabled >20 % of storage data migration and >80 % of SDK migrations.
Future Directions
Integrate large‑model AI for JSON‑nested field root‑cause analysis.
Automatic table‑splitting strategies for massive tables.
Codify common comparison patterns into reusable solutions across teams.
DeWu Technology
A platform for sharing and discussing tech knowledge, guiding you toward the cloud of technology.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
