Big Data 8 min read

Optimizing HDFS Federation Data Migration with FastCopy and qFastCopy at Qunar

This article describes the challenges of scaling Qunar's Hadoop NameNode, introduces HDFS Federation and the FastCopy tool, presents performance tests comparing FastCopy with DistCp, and details the development and evaluation of an optimized qFastCopy solution that reduces multi‑petabyte migration time from hours to a few.

Qunar Tech Salon
Qunar Tech Salon
Qunar Tech Salon
Optimizing HDFS Federation Data Migration with FastCopy and qFastCopy at Qunar

Background

Qunar's Hadoop cluster has continuously been optimized as the business grew, but a single‑NameNode setup reached a bottleneck: with a 180 GB heap the metadata red line is about 700 million entries, and RPC response time and QPS degrade as the cluster expands.

HDFS Federation

HDFS Federation, introduced in Hadoop‑0.23.0, provides a horizontal scaling solution by creating multiple independent NameSpaces (each a separate NameNode with its own metadata), thereby improving scalability and isolation.

FastCopy Overview

FastCopy is an open‑source data‑copy solution from Facebook (see HDFS‑2139). It reads file information and block mappings from the source NameNode, creates files on the target NameNode, and copies blocks using Linux hard‑link copying, which avoids additional storage consumption.

Advantages: high speed and no extra storage. Drawbacks: it does not copy file permissions or ownership, requiring a post‑copy fix that adds roughly one‑third to one‑half of the total copy time.

FastCopy vs. DistCp Test

Test environment: 2 NameSpaces, 50 DataNodes. Metadata volume ranged from 1 million to 100 million entries. FastCopy took 0.68 minutes to 90 minutes, while DistCp took 5 minutes to 830 minutes.

Test Results

Metadata amount vs. copy time (line chart):

Test Analysis

Copying 5 billion records with DistCp would require about 4 days, causing unacceptable downtime for reporting and model training. FastCopy is estimated to need 90 × 5/60 × 1.8 ≈ 13.5 hours for data copy plus roughly 6 hours for permission fixing, totaling about 20 hours.

During testing, the bottleneck of FastCopy was identified as the concurrency of the active NameNode; the source code shows multiple requests per metadata entry, suggesting an optimization point.

FastCopy Optimization (qFastCopy)

FastCopy is broadly applicable in Federation clusters, but the current challenge is the long migration time when splitting a single NameNode into multiple NameNodes. By stopping write services, creating a snapshot to keep the fsimage unchanged, and then applying optimizations, we developed an enhanced version called qFastCopy.

Original FastCopy Process and Steps

Resource and Performance Analysis of Original FastCopy

Optimization Scheme

qFastCopy Process

Specific steps of qFastCopy are illustrated below:

qFastCopy Limitations

Applicable only during Federation when splitting NameNodes; requires pre‑copied fsimage on the target cluster.

Source and target absolute paths must be identical.

The cluster must be read‑only (no write operations) throughout the process.

qFastCopy Test

FastCopy vs. qFastCopy Comparison

Metadata vs. copy time chart demonstrates qFastCopy’s superior speed.

Analysis and Conclusion

For copying 5 billion records, qFastCopy requires approximately 22 × 5/60 × 1.8 ≈ 3.5 hours, reducing the migration time from the 20 hours needed by FastCopy to about 3.5 hours, a substantial improvement for production clusters.

data migrationPerformance OptimizationBig DataHDFSFastCopyFederationqFastCopy
Qunar Tech Salon
Written by

Qunar Tech Salon

Qunar Tech Salon is a learning and exchange platform for Qunar engineers and industry peers. We share cutting-edge technology trends and topics, providing a free platform for mid-to-senior technical professionals to exchange and learn.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.