Big Data 16 min read

How We Migrated 40PB of Offline Big Data Across Clouds with Zero Downtime

Over a year after completing a five‑month, cross‑cloud migration of Huolala’s 40 PB offline big‑data platform—spanning storage, compute, services, and infrastructure—the team details the architecture, verification methods, high‑throughput migration tools, network isolation strategies, and lessons learned to guide similar large‑scale data migrations.

Huolala Tech
Huolala Tech
Huolala Tech
How We Migrated 40PB of Offline Big Data Across Clouds with Zero Downtime

Preface

At the end of 2023, the company decided to launch an offline freight big‑data migration project. After five months of coordinated effort, the project completed a full‑scale cross‑cloud migration of offline links (tasks, data, services, and infrastructure) in May 2024, involving more than ten departments. One year later, the team reflects on the challenges overcome and the solid foundation laid for stable operation.

While many cloud‑migration case studies exist, few focus on big‑data implementation details. This article documents the complete offline big‑data migration process to provide experience for similar industry projects. It first introduces the overall migration design and execution flow, and later a series of public‑account articles will deep‑dive into data migration techniques and validation methods.

Background

1. Multi‑cloud big‑data architecture

Huolala’s big‑data IT architecture follows a “multi‑cloud + self‑built on cloud” model. Core big‑data services rely only on the IaaS layer of cloud providers, which required significant early investment but offers strong controllability, deep optimization potential, and ease of migration and replication.

Before 2020: online services and big‑data services were deployed on the same cloud.

After 2020: offline big‑data services migrated to an offline cloud, entering a multi‑cloud stage, providing better bargaining power and complementary cloud advantages.

After May 2024: the offline big‑data services were migrated from the original offline cloud to a new cloud.

2. Offline big‑data scale

2.1 Offline storage

The migration involved roughly 40 PB of data accumulated over ten years and more than 40 000 data‑processing tasks, a leading volume in the freight industry.

Business line

Data volume

File count

Task count

Departments involved

HLL

40 PB

10 billion+

40 000+

17

2.2 Offline compute

Huolala’s offline big‑data cluster approaches a thousand nodes, including Presto hybrid engine clusters, business‑specific compute clusters, distributed scheduling service nodes, and GPU/CPU heterogeneous compute pools. Migration required careful coordination with online low‑latency services and real‑time streaming clusters, presenting significant network and permission challenges.

Migration Scheme Design

The migration plan demanded high technical assurance: data must be accurate and timely, downtime minimal, and business impact negligible. Building on previous migration experience, a “verifiable and rollback‑able” scheme was redesigned.

Verifiable :

Performance verification: conduct extensive POC testing of storage and compute performance in the new environment; run dual‑run tasks in the new cloud and track performance metrics.

Data verification: compare large tables and files between old and new environments to ensure data quality.

Rollback‑able : adopt a primary‑secondary dual‑run approach; if issues arise during switch‑over, revert to the primary link.

The overall scheme is illustrated below:

Infrastructure setup: complete network planning, component adaptation, and big‑data cluster delivery in the new cloud.

Data migration: migrate stored data, metadata, and offline link tasks.

Dual‑run and data verification: enable sampling tasks (Mysql/other to Hive) on the new cloud and verify results daily.

Switch‑over and resource decommission: after data quality acceptance, switch the new cloud link to primary, keep the old link as backup, and eventually decommission old resources.

Migration Scheme Implementation

Implementation began in December 2023. Below are selected challenges and solutions.

1. Network isolation

The big‑data network spans “offline cloud (old) – offline cloud (new) – online cloud”. Isolation is performed at component‑port granularity without affecting existing business networks. Key steps:

Topology mapping and fine‑grained port‑level detailing for four clusters, 30+ components, and the IDP offline scheduling platform.

Primary‑secondary link isolation using network whitelists, allowing only primary data sync to the backup.

Backup‑online cloud isolation via network blacklists to prevent data leakage during dual‑run.

Pre‑switch network validation by temporarily enabling isolation and confirming that unsupportable tasks cannot run.

Post‑switch configuration: retain new‑cloud/old‑cloud policies, add isolation between old‑cloud and online cloud.

2. Massive data migration

Moving 40 PB of constantly changing data required a systematic approach. The team built a high‑throughput, scalable migration tool that runs on thousands of machines, supports Hive‑level, partition‑level, and file‑level comparisons, and synchronizes metadata.

High throughput: sustained 100 Gbps cross‑cloud bandwidth.

High performance: reduced full‑partition row‑count comparison from 18 days to 2 days; reduced 1.4 billion file‑metadata comparison from 5 hours to 1.5 hours.

3. Data consistency assurance

Data consistency: copy >500 TB of daily incremental data, generate library‑level and table‑level comparison reports.

Table metadata consistency: automated schema diff, generate DDL for added fields or reordered columns, auto‑sync based on importance and size.

Code consistency: automatic synchronization of user‑modified data and compute tasks between primary and backup environments.

4. Data validation

After full data migration and activation of sampling tasks, the new environment entered dual‑run. Validation covered ~20 departments and tens of thousands of Hive tables.

Self‑developed automated data comparison platform reduced verification time by over 90%, enabling a single person to validate ~1500 critical tables in a few days.

Coarse validation: compare row counts of tables and partitions; proceed to fine validation only if counts match.

Fine validation: compare field values, support error‑tolerance for metrics like order amounts.

Prioritized validation: verify infrastructure layer first, then application layer, producing daily reports and closing all discrepancies.

5. Primary‑backup switch‑over

After successful dual‑run and validation, the team prepared a detailed SOP and a dedicated “link switch‑over guarantee team” to ensure a smooth transition.

Reflection and Summary

The project concluded successfully with accurate data output in the new environment. The team reflected on improvements for future migrations:

Continuous iteration of migration plans : comprehensive coverage of risks and multi‑round refinement are essential for large‑scale data pipelines.

Automation is critical :

Data migration tool greatly improves efficiency and can be reused for disaster recovery.

Automated infrastructure delivery shortens setup time.

Automated data comparison saved over 20 person‑months of manual effort.

Cloud technology selection experience : mature cost estimation, performance testing, and stability assessment processes enable rapid POC on new clouds.

Acknowledgments

The offline big‑data cloud migration was a joint effort of Huolala’s technology center and more than ten business units. The success would not have been possible without the tireless work of the project team and the support of cloud‑provider experts.

data engineeringcloud migrationAutomationNetwork Isolationdata validationcross-cloud
Huolala Tech
Written by

Huolala Tech

Technology reshapes logistics

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.