Hadoop Cluster Cross-Data Center Migration Practice at Tongcheng Travel
This article details Tongcheng Travel’s month‑long, zero‑downtime migration of hundreds of petabytes of Hadoop HDFS and YARN clusters across data centers, describing the background, migration strategies, lessons learned, tool enhancements, and future plans to improve data locality, balance, and monitoring.
1. Background
As Tongcheng Travel's business and data volume grew, the existing data center could no longer support the expansion needs for the next few years. The old data center also had a lower priority for maintenance, so the company decided to migrate to a new data center. The goal was to move over a hundred petabytes of Hadoop clusters within one month without stopping services.
The Hadoop environment consists of multiple 2.x clusters that were upgraded to federation mode in 2019, resulting in dozens of namespaces. About 80% of the workloads rely on HDFS, while YARN handles resource scheduling for data warehouses, algorithm analysis, and machine learning.
2. Migration Plan
Because building multiple new HDFS clusters in the new data center would be costly and require heavy data synchronization, the initial multi‑cluster approach was discarded. Instead, a single‑cluster scaling strategy was chosen.
Two implementation schemes were considered:
Data‑center awareness strategy
Replica‑selection node strategy
A: Add Data‑Center Awareness Strategy
1. Modify NameNode core code to store a data‑center attribute for each node.
2. For tenants or directories that need migration, create isolated YARN clusters to fully exploit HDFS locality and reduce cross‑data‑center bandwidth.
3. Migrate by directory/tenant to control progress and monitor network stability.
Advantages: Fine‑grained migration; controllable process.
Disadvantages: Requires NameNode code changes, testing, and results in a longer migration window.
B: Replica‑Selection Strategy Favoring New Data‑Center
Control new writes to target low‑utilization DataNodes in the new data center, moving data at the source. Historical data can be accelerated using the balance command with specified IP lists.
Advantages: No need to rely on specific directories/tenants; migration can be performed rack‑wise.
Disadvantages: May cause hotspots on low‑utilization DataNodes; balance is slow and cannot meet rapid migration needs; high bandwidth pressure may increase RPC load during peak periods.
We chose scheme B because it offers a simpler workflow and meets the tight deadline, despite requiring source code changes to the replica‑selection policy and further optimization of the balance performance.
3. Experience and Lessons
3.1 Slow DataNode Decommission
Initially, decommissioning a rack of DataNodes took over 6 hours. By tuning parameters to increase replication streams, the time was reduced to about 1 hour, satisfying migration requirements.
dfs.namenode.replication.max-streams
64
dfs.namenode.replication.max-streams-hard-limit
128
dfs.namenode.replication.work.multiplier.per.iteration
323.2 Data Skew and Hotspots
During scaling, decommissioned DataNodes caused remaining nodes to experience higher load, leading to two typical problems.
3.2.1 New Writes Still Prefer High‑Utilization Nodes
When new data is written to heavily used DataNodes, it creates hotspots that affect read/write performance and can cause NodeManagers to become unhealthy.
To mitigate this, Hadoop 2.5 introduced a space‑aware replica policy that prefers low‑utilization DataNodes. However, early versions had null‑pointer bugs (see JIRA HDFS‑10715, HDFS‑14578, HDFS‑8131) which were fixed and contributed back to the community.
3.2.2 Over‑Selection of Low‑Utilization Nodes
After enabling the space‑aware policy, excessive writes to low‑utilization nodes could create new hotspots. The replica‑selection logic was further refined to consider heartbeat reports and thread counts, avoiding overly busy nodes.
3.2.3 Short‑Circuit Read Optimization
Short‑circuit reads exploit client‑DataNode co‑location. We modified the logic so that when the preferred DataNode is under heavy load, the client can select a less busy node, improving overall throughput.
3.3 Balance as a Double‑Edged Sword
3.4.1 Custom Balance Program
The default balance includes all nodes, which wastes resources on low‑utilization nodes. We wrapped the program to allow specifying source (high‑utilization) and destination (low‑utilization) IPs, reducing unnecessary RPC traffic.
3.4.2 Size‑Based Balancing
Small files consume a thread and RPC each, offering little benefit. We added a size filter to the balance process, doubling efficiency; the community is discussing this enhancement.
Related JIRA issues: HDFS‑8824, HDFS‑9412, HDFS‑13222, HDFS‑13356.
3.4.3 Balance Monitoring
Balance can generate massive RPC traffic that interferes with nightly ETL jobs. We built a 24‑hour monitoring tool to ensure balance runs only during low‑load periods.
3.5 Cluster‑Wide Monitoring
We also monitor DataNode health, network bandwidth, and RPC metrics daily, issuing alerts when performance thresholds are approached.
4. Cross‑Cluster Migration Tool Enhancements
While Hadoop provides the distcp tool, it has several limitations: client‑only execution, poor concurrency, cumbersome state management, lack of final consistency checks, and permission mismatches.
We refactored and extended distcp with the following features:
Concurrent Job Submission: Multiple users can submit migration jobs 24/7 without coordinator involvement.
Multi‑Job State Management: Each job’s status is recorded for easy tracking.
Data Consistency Verification: Pre‑ and post‑migration statistics are compared, with alerts on mismatches.
Intelligent Permission Matching: Permissions are preserved or inherited from target directories.
Multi‑Job Progress Management: Automatic progress refresh and UI display.
Fuzzy Matching: Allows batch selection of paths based on patterns.
Blacklist Migration: Filters out real‑time changing directories to avoid inconsistency.
Scheduled & Macro‑Variable Migration: Automates daily changing data transfers.
Special File Auto‑Handling: Detects and fixes hidden or corrupted files.
Large Task Smart Splitting: Breaks massive jobs into smaller sub‑jobs for efficiency.
Metadata One‑Click Update: Updates Hive partitions and other metadata alongside data migration.
5. Summary and Future Outlook
The migration project delivered the following key improvements:
Support and optimization of space‑aware replica strategy.
Short‑circuit read enhancements.
Hotspot mitigation for DataNodes.
Extended functionality of the cluster migration tool.
Multi‑dimensional migration monitoring.
Future work includes upgrading Hadoop versions to unify the current multi‑version environment and leveraging erasure coding (EC) in newer releases to reduce cold‑data storage costs, further driving cost‑efficiency.
Tongcheng Travel Technology Center
Pursue excellence, start again with Tongcheng! More technical insights to help you along your journey and make development enjoyable.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.