Big Data 14 min read

How We Migrated Hundreds of Petabytes of Hadoop Data Without Downtime

This article details the background, challenges, and step‑by‑step solutions for migrating over a hundred petabytes of Hadoop HDFS data across data centers within a month, covering strategy selection, code modifications, balance optimization, and tool enhancements.

dbaplus Community
dbaplus Community
dbaplus Community
How We Migrated Hundreds of Petabytes of Hadoop Data Without Downtime

Background

As the business and data volume of Tongcheng Travel grew, the existing data center could no longer meet the expansion needs for the next few years. The old data center also had lower priority for maintenance, so the company decided to migrate to a new data center. The goal was to move over a hundred petabytes of Hadoop clusters within one month without stopping services.

Current Hadoop Landscape

The environment consists of multiple Hadoop 2.x clusters that were upgraded to a federated mode in 2019, resulting in dozens of namespaces. About 80% of workloads rely on HDFS, with YARN handling resource scheduling for data warehousing, algorithm analysis, and machine learning.

Migration Strategies

Two main approaches were evaluated:

Data‑center‑aware strategy : Modify the NameNode core code to add data‑center attributes to nodes, isolate tenant directories with dedicated YARN instances, and migrate by directory/tenant to control progress and monitor cross‑data‑center network stability. Pros: Fine‑grained control; Cons: Requires extensive testing of NameNode changes and longer migration time.

Replica‑priority strategy (chosen) : Adjust replica placement so new writes prefer low‑utilization DataNodes in the new data center. Historical data can be moved using balance with specified IP lists. Pros: Simpler workflow, can migrate by rack; Cons: Potential hotspot creation on low‑utilization nodes, slower balance, high bandwidth and RPC pressure.

The team selected strategy B because it allowed faster completion with a simpler process, despite needing source‑code changes to the replica‑selection policy and performance tuning for balance.

Experiences and Lessons Learned

1. Slow DataNode Decommission

Initial decommission of a rack took >6 hours. By tuning replication parameters (e.g., dfs.namenode.replication.max-streams, dfs.namenode.replication.max-streams-hard-limit, dfs.namenode.replication.work.multiplier.per.iteration) the time was reduced to ~1 hour.

2. DataNode Imbalance

During expansion, newly written data tended to target high‑utilization nodes, causing hotspots and NodeManager unhealthy states. To mitigate this, a space‑aware replica policy was introduced, preferring low‑utilization nodes. The policy was later refined to also consider heartbeat and thread load, preventing new hotspots on low‑utilization nodes.

3. Short‑circuit Read Optimization

When client and DataNode reside on the same host, short‑circuit reads can be used. The code was modified to avoid selecting overloaded preferred nodes, improving read performance.

4. Balance as a Double‑Edged Sword

Custom balance program now accepts IP‑based source/destination selection, reducing unnecessary node participation.

Support for file‑size‑based balancing improves efficiency, roughly doubling throughput.

Monitoring daemon tracks balance RPC load, ensuring it is paused during peak ETL windows.

5. Cluster‑wide Monitoring

Metrics for DataNode health, network bandwidth, and RPC rates are collected daily, with alerts for performance bottlenecks.

Cross‑Cluster Migration Tool Enhancements

The native Hadoop distcp tool has several limitations (client‑only execution, poor concurrency, lack of consistency checks, permission mismatches, etc.). The team wrapped and extended it with the following features:

Concurrent job submission by multiple users 24/7.

Job‑level status tracking.

Pre‑ and post‑migration data consistency verification.

Intelligent permission matching and inheritance.

Real‑time progress refresh.

Fuzzy path matching for batch migrations.

Blacklist support for real‑time changing directories.

Scheduled migrations with macro‑variable date handling.

Automatic handling of special files (e.g., hidden files).

Automatic splitting of large tasks into smaller jobs.

One‑click Hive metadata updates after HDFS migration.

These enhancements address the earlier listed distcp shortcomings and enable efficient, reliable cross‑cluster data movement.

Conclusion and Future Work

The migration project delivered:

Optimized space‑aware replica strategy.

Improved short‑circuit read handling.

Mitigated DataNode hotspot issues.

Enhanced migration tooling with monitoring, consistency checks, and permission management.

Multi‑dimensional migration monitoring.

Future plans include upgrading Hadoop versions to unify the multi‑version landscape and leveraging Erasure Coding (EC) in newer releases to reduce cold‑data storage costs.

References

https://issues.apache.org/jira/browse/HDFS-8824

https://issues.apache.org/jira/browse/HDFS-9412

https://issues.apache.org/jira/browse/HDFS-13222

https://issues.apache.org/jira/browse/HDFS-13356

https://issues.apache.org/jira/browse/HDFS-10715

https://issues.apache.org/jira/browse/HDFS-14578

https://issues.apache.org/jira/browse/HDFS-8131

data migrationhdfsHadoopcluster scalingDistcpBig Data OperationsBalance Optimization
dbaplus Community
Written by

dbaplus Community

Enterprise-level professional community for Database, BigData, and AIOps. Daily original articles, weekly online tech talks, monthly offline salons, and quarterly XCOPS&DAMS conferences—delivered by industry experts.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.