Big Data 16 min read

How eBay Migrated 10 PB of HDFS Data Across Namespaces in Just 2 Hours

This article details how eBay's ADI Hadoop team tackled a massive 10 PB, 10‑million‑file migration by optimizing DistCp with Fastcopy, load‑balancing, ACL handling, and failure recovery, ultimately completing the transfer within a two‑hour window while preserving cluster stability and performance.

dbaplus Community
dbaplus Community
dbaplus Community
How eBay Migrated 10 PB of HDFS Data Across Namespaces in Just 2 Hours

Background

Hadoop Distributed File System (HDFS) is designed for commodity hardware. eBay's ADI Hadoop team faced a sudden surge in RPC traffic—2–3 times the normal load—due to rapid data growth, threatening SLA compliance. To relieve pressure they decided to migrate roughly 10 million files (≈10 PB) to a new HDFS namespace using Federation.

Challenges of Ultra‑Large Data Migration

The primary difficulty was moving petabyte‑scale data within a strict 2–3 hour maintenance window. A naïve DistCp copy of 10 PB would take about ten days, far exceeding the allowed downtime.

Migration Plan Selection

Two strategies were evaluated:

Initial full copy + multiple incremental diffs : Perform a large initial copy, then repeatedly sync incremental changes. This leverages DistCp’s snapshot‑based incremental copy but requires complex automation and extra storage for both old and new data.

Static data copy + final dynamic copy : Copy all static data once, then perform a single final copy of only the changed data. Simpler to execute and ultimately chosen for the migration.

The team selected the second approach and focused on dramatically speeding up DistCp.

Comprehensive DistCp Optimizations

Five key enhancements were implemented:

Integration of HDFS Fastcopy : Back‑ported community patch HDFS‑2139 to create hard‑links instead of physically copying data, reducing copy time by a factor of five. An option useFastcopy was added to enable this mode.

Data‑skew mitigation : Adjusted UniformSizeInputFormat to treat files smaller than one block as a full block, balancing task load and eliminating long‑tail tasks caused by mixed file sizes.

Pre‑emptive ACL preservation : Moved directory ACL handling from the final commit phase into individual map tasks, cutting commit time.

Failure‑task handling : Added a dedicated reduce task to collect failure records, enabling automatic retry of failed files after the main copy.

Parameter tuning : Increased ipc.client.connect.max.retries.on.timeouts to 3, extended mapreduce.task.timeout to 30 minutes, and redirected DistCp metadata to a faster internal HDFS cluster to avoid metadata bottlenecks.

Fastcopy architecture diagram
Fastcopy architecture diagram
DistCp with additional reduce task
DistCp with additional reduce task

Support for HDFS Migration

Two additional issues were addressed:

Viewfs nested mapping : Implemented community patch HADOOP‑13055 to support fallback mapping, enabling nested Viewfs configurations required by the Federation setup.

Open‑file handling : Utilized HDFS‑10480 to list and forcibly close open files before copy, ensuring metadata consistency.

Multi‑Round Incremental Testing

Four testing phases validated the solution:

Performance baseline for DistCp.

Functional correctness (metadata, ACLs, file contents).

Scale‑up simulation to expose hidden bottlenecks.

Production‑grade Snapshot copy rehearsal, revealing real‑world issues such as network timeouts and slow nodes.

Results showed that increasing DistCp map tasks beyond 5 000 did not improve throughput because the NameNode RPC call queue became saturated.

DistCp task count vs. throughput
DistCp task count vs. throughput
NameNode call queue saturation
NameNode call queue saturation

Results and Impact

On the migration day, the team pre‑copied roughly half of the historical data, then transferred the remaining 5–6 million files within the two‑hour window without SLA violations. The migration reduced daily RPC traffic by ~5 billion requests and freed ~20 GB of heap memory. Post‑migration RPC latency for the moved data dropped below 1 ms.

The experience highlighted the need for better automation (e.g., leveraging HDFS RBF) to reduce manual steps in future migrations.

References

https://issues.apache.org/jira/browse/HDFS-2139

https://issues.apache.org/jira/browse/HDFS-10467

https://issues.apache.org/jira/browse/HDFS-10480

https://issues.apache.org/jira/browse/HADOOP-13055

https://hadoop.apache.org/docs/current/hadoop-distcp/DistCp.html

https://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-hdfs/Federation.html

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

Performance OptimizationBig DataHDFSHadoopDistcp
dbaplus Community
Written by

dbaplus Community

Enterprise-level professional community for Database, BigData, and AIOps. Daily original articles, weekly online tech talks, monthly offline salons, and quarterly XCOPS&DAMS conferences—delivered by industry experts.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.