Optimizing HDFS Federation Data Migration with FastCopy and qFastCopy at Qunar
This article describes the challenges of scaling Qunar's Hadoop NameNode, introduces HDFS Federation and the FastCopy tool, presents performance tests comparing FastCopy with DistCp, and details the development and evaluation of an optimized qFastCopy solution that reduces multi‑petabyte migration time from hours to a few.
Background
Qunar's Hadoop cluster has continuously been optimized as the business grew, but a single‑NameNode setup reached a bottleneck: with a 180 GB heap the metadata red line is about 700 million entries, and RPC response time and QPS degrade as the cluster expands.
HDFS Federation
HDFS Federation, introduced in Hadoop‑0.23.0, provides a horizontal scaling solution by creating multiple independent NameSpaces (each a separate NameNode with its own metadata), thereby improving scalability and isolation.
FastCopy Overview
FastCopy is an open‑source data‑copy solution from Facebook (see HDFS‑2139). It reads file information and block mappings from the source NameNode, creates files on the target NameNode, and copies blocks using Linux hard‑link copying, which avoids additional storage consumption.
Advantages: high speed and no extra storage. Drawbacks: it does not copy file permissions or ownership, requiring a post‑copy fix that adds roughly one‑third to one‑half of the total copy time.
FastCopy vs. DistCp Test
Test environment: 2 NameSpaces, 50 DataNodes. Metadata volume ranged from 1 million to 100 million entries. FastCopy took 0.68 minutes to 90 minutes, while DistCp took 5 minutes to 830 minutes.
Test Results
Metadata amount vs. copy time (line chart):
Test Analysis
Copying 5 billion records with DistCp would require about 4 days, causing unacceptable downtime for reporting and model training. FastCopy is estimated to need 90 × 5/60 × 1.8 ≈ 13.5 hours for data copy plus roughly 6 hours for permission fixing, totaling about 20 hours.
During testing, the bottleneck of FastCopy was identified as the concurrency of the active NameNode; the source code shows multiple requests per metadata entry, suggesting an optimization point.
FastCopy Optimization (qFastCopy)
FastCopy is broadly applicable in Federation clusters, but the current challenge is the long migration time when splitting a single NameNode into multiple NameNodes. By stopping write services, creating a snapshot to keep the fsimage unchanged, and then applying optimizations, we developed an enhanced version called qFastCopy.
Original FastCopy Process and Steps
Resource and Performance Analysis of Original FastCopy
Optimization Scheme
qFastCopy Process
Specific steps of qFastCopy are illustrated below:
qFastCopy Limitations
Applicable only during Federation when splitting NameNodes; requires pre‑copied fsimage on the target cluster.
Source and target absolute paths must be identical.
The cluster must be read‑only (no write operations) throughout the process.
qFastCopy Test
FastCopy vs. qFastCopy Comparison
Metadata vs. copy time chart demonstrates qFastCopy’s superior speed.
Analysis and Conclusion
For copying 5 billion records, qFastCopy requires approximately 22 × 5/60 × 1.8 ≈ 3.5 hours, reducing the migration time from the 20 hours needed by FastCopy to about 3.5 hours, a substantial improvement for production clusters.
Qunar Tech Salon
Qunar Tech Salon is a learning and exchange platform for Qunar engineers and industry peers. We share cutting-edge technology trends and topics, providing a free platform for mid-to-senior technical professionals to exchange and learn.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.