Big Data 20 min read

How We Migrated 40 PB of Hive Data Across Clouds with Zero Downtime

This article details the end‑to‑end design, challenges, and implementation of a cross‑cloud migration of over 200 k Hive tables and nearly 40 PB of data using the self‑developed Kirk service, covering architecture, verification steps, and lessons learned to achieve 100 % data consistency without impacting production services.

Huolala Tech
Huolala Tech
Huolala Tech
How We Migrated 40 PB of Hive Data Across Clouds with Zero Downtime

Data Migration Background

As described in "货拉拉离线大数据跨云迁移-综述篇", the company decided at the end of 2023 to launch an offline big‑data cross‑cloud migration project. Our team is responsible for executing the migration of offline data.

Facing massive data, we needed a high‑throughput, high‑performance, scalable migration tool with flexible file‑consistency comparison. Existing tools such as COSDistCp and AzCopy did not meet performance or flexibility requirements, so we upgraded our self‑developed Kirk data migration service.

Data Migration Challenges

Huala has accumulated over 200,000 tables and nearly 40 PB of Hive data. Migration must ensure schema consistency, keep the source big‑data services stable, and handle continuous incremental and real‑time updates, requiring an incremental sync mechanism on top of full migration.

The migration must achieve 100 % data transfer, 100 % consistency, and zero impact on business, which demands careful planning, tooling, and automation.

Specific challenges include:

Challenge 1: 100 % data migration

Massive data: ~40 PB to be completed within three months.

Metadata migration: >200k tables’ schema, partitions, permissions must be migrated automatically and remain identical.

High‑performance requirements: network bandwidth, storage throughput, tool concurrency, and component stability.

Migration strategy: batch, time‑windowed, parallel migration to avoid overloading the source.

Challenge 2: 100 % data consistency

Full + incremental sync: fast bulk migration plus continuous dual‑write sync.

Dynamic consistency: source data keeps changing, requiring real‑time alignment.

Structural consistency: field types, partitioning, storage format must match exactly.

Challenge 3: Zero business impact

Business zero‑awareness: services must remain online during migration.

Smooth switch‑over: gray‑release and immediate rollback if anomalies occur.

Architecture Design

1. Overall Architecture

Kirk supports cross‑cloud table‑schema and Hive data migration and consists of four modules:

Table‑management module : schema comparison, automatic creation/modification for metadata alignment.

Data‑comparison module : Hive metadata and row‑count consistency checks, generating copy or delete tasks for differences.

Task‑scheduling module : executes Hive copy/delete tasks to synchronize source to target.

Back‑track verification module : final safeguard after switch‑over, verifying data before source deletion.

2. Table‑management Module

Table schema comparison

Process:

Metadata comparison : storage format, location, column names, types, comments, order, etc.

Most tables are handled automatically (small or non‑core tables):

Delete and recreate target table.

Or execute ALTER to add columns.

Some large or core tables require manual intervention:

Determine if table can be dropped and rebuilt, then generate high‑priority copy tasks.

Or manually execute DDL to modify schema.

3. Data‑comparison Module

Data consistency is ensured through a three‑step verification process.

We perform coarse‑to‑fine checks to filter inconsistencies.

Why three‑step verification?

In distributed storage (HDFS, OSS, S3, COS), matching metadata (size, file count, modification time) does not guarantee content equality. Common anomalies include missing partitions, file mismatches, and corrupted content.

Partition missing : some partitions not synchronized.

File inconsistency : size or timestamp differences indicate updates or corruption.

Metadata matches but content corrupted : network or storage failures cause unreadable files.

A layered verification flow quickly discovers issues and enables precise fixes.

Comparison report

We provide database‑level and table‑level reports with global consistency rate and intersection consistency rate to assess overall and overlapping data quality.

4. Task‑scheduling Module

Task acquisition: poll DB by priority.

Task execution: DistCp copy with two modes—copy to temporary directory then rename (for stable offline tables) or direct update copy (for large, frequently changing real‑time tables).

After successful DistCp, add partition via HQL; failures trigger automatic retries.

Deleted source partitions trigger corresponding target partition drops.

Concurrency is throttled during peak hours to avoid impacting core pipelines.

5. Back‑track Verification Module

It acts as the final safeguard before physical deletion of source data, ensuring migrated data has no quality issues.

Why back‑track verification?

After switch‑over, source buckets enter a deletion countdown. If data is missing or corrupted, recovery is impossible. Ongoing verification still leaves risks such as network‑induced corruption or accidental overwrites.

Core logic

We treat data as a mix of migrated, newly added, and back‑filled data. Using the switch‑over timestamp (2024‑05‑18 20:00:00) as an anchor, we classify files:

Files before the anchor are considered migrated data for back‑track comparison.

Files after the anchor are target‑side new data, recorded for manual analysis.

Source files after the anchor are source‑side back‑fill data, also recorded for analysis.

Verification logs record path, discrepancy type, file owner, timestamps, etc., without automatic copying.

Exceptions are filtered into two dimensions:

Data completeness : source file exists but target missing; anchor determines if missing occurred before or after switch‑over.

Data consistency : both sides have file but metadata differs (size, timestamp, owner).

Resulting exception candidate list narrows manual inspection scope.

Manual verification involves data owners confirming anomalies; confirmed issues trigger manual copy tasks, while false positives are marked as “no action”.

Migration Implementation

The migration consists of three phases:

Phase 1 – Bulk data migration (schema migration and bulk data transfer) before the day of switch‑over.

Phase 2 – Switch‑over on 2024‑05‑18, ensuring all data and schema are aligned.

Phase 3 – Post‑switch‑over back‑track verification before deleting the original cloud data.

1. Phase 1 – Bulk Data Migration

1.1 Table schema migration

Hive schema migration precedes data migration. The table‑management module periodically or manually compares source and target schemas, creates missing tables, and aligns existing ones. Most inconsistencies are auto‑fixed via add/alter/drop; a few require manual DDL.

1.2 Data migration

Data migration relies on the comparison and scheduling modules. Comparison discovers new or changed data and generates copy tasks; the scheduler runs DistCp or partition‑drop tasks. To handle billions of partitions, the comparison module scales horizontally, grouping libraries by size for balanced load.

Some tables are excluded via a blacklist. Additional count‑based checks before switch‑over identified a few corrupted copies, which were re‑copied.

For real‑time Hive data, most are written by Flink jobs consuming Kafka; a few are written by both offline and online pipelines. Kirk synchronizes data until two weeks before switch‑over, after which real‑time tasks take over.

2. Phase 2 – Switch‑over

On the switch‑over day, after core reporting tasks finish, both source and target platforms stop writes. Full data and schema alignment is performed, first for the current year, then for historical data back to 1990‑01‑01. After alignment, comparison and copy instances are stopped.

3. Phase 3 – Data Back‑track Verification

Post‑switch‑over, incremental data quality is verified, but source buckets will be physically deleted. Back‑track verification ensures no data loss or corruption before deletion, distinguishing migrated data from target‑side new writes using the anchor timestamp.

Our verification found no inconsistencies or missing data.

Migration Issues

During migration we faced difficulties with the Kirk tool itself and with underlying components such as Hive and YARN, but the team overcame them to complete the migration successfully.

Migration Summary

Using the self‑developed Kirk system, we successfully migrated over 200 k tables and nearly 40 PB of data with no data‑quality or schema‑consistency issues, ensuring a successful cloud migration.

Post‑migration back‑track comparison confirmed 100 % historical data consistency and no copy‑induced corruption.

The half‑year project involved cross‑cloud migration, performance tuning, and collaborative problem solving across the big‑data ecosystem.

data migrationBig Datatask schedulingData ConsistencyHivecross-cloudKirk
Huolala Tech
Written by

Huolala Tech

Technology reshapes logistics

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.