How eBay Migrated 10 PB of HDFS Data Across Namespaces in Just 2 Hours
This article details how eBay's ADI Hadoop team tackled a massive 10 PB, 10‑million‑file migration by optimizing DistCp with Fastcopy, load‑balancing, ACL handling, and failure recovery, ultimately completing the transfer within a two‑hour window while preserving cluster stability and performance.
Background
Hadoop Distributed File System (HDFS) is designed for commodity hardware. eBay's ADI Hadoop team faced a sudden surge in RPC traffic—2–3 times the normal load—due to rapid data growth, threatening SLA compliance. To relieve pressure they decided to migrate roughly 10 million files (≈10 PB) to a new HDFS namespace using Federation.
Challenges of Ultra‑Large Data Migration
The primary difficulty was moving petabyte‑scale data within a strict 2–3 hour maintenance window. A naïve DistCp copy of 10 PB would take about ten days, far exceeding the allowed downtime.
Migration Plan Selection
Two strategies were evaluated:
Initial full copy + multiple incremental diffs : Perform a large initial copy, then repeatedly sync incremental changes. This leverages DistCp’s snapshot‑based incremental copy but requires complex automation and extra storage for both old and new data.
Static data copy + final dynamic copy : Copy all static data once, then perform a single final copy of only the changed data. Simpler to execute and ultimately chosen for the migration.
The team selected the second approach and focused on dramatically speeding up DistCp.
Comprehensive DistCp Optimizations
Five key enhancements were implemented:
Integration of HDFS Fastcopy : Back‑ported community patch HDFS‑2139 to create hard‑links instead of physically copying data, reducing copy time by a factor of five. An option useFastcopy was added to enable this mode.
Data‑skew mitigation : Adjusted UniformSizeInputFormat to treat files smaller than one block as a full block, balancing task load and eliminating long‑tail tasks caused by mixed file sizes.
Pre‑emptive ACL preservation : Moved directory ACL handling from the final commit phase into individual map tasks, cutting commit time.
Failure‑task handling : Added a dedicated reduce task to collect failure records, enabling automatic retry of failed files after the main copy.
Parameter tuning : Increased ipc.client.connect.max.retries.on.timeouts to 3, extended mapreduce.task.timeout to 30 minutes, and redirected DistCp metadata to a faster internal HDFS cluster to avoid metadata bottlenecks.
Support for HDFS Migration
Two additional issues were addressed:
Viewfs nested mapping : Implemented community patch HADOOP‑13055 to support fallback mapping, enabling nested Viewfs configurations required by the Federation setup.
Open‑file handling : Utilized HDFS‑10480 to list and forcibly close open files before copy, ensuring metadata consistency.
Multi‑Round Incremental Testing
Four testing phases validated the solution:
Performance baseline for DistCp.
Functional correctness (metadata, ACLs, file contents).
Scale‑up simulation to expose hidden bottlenecks.
Production‑grade Snapshot copy rehearsal, revealing real‑world issues such as network timeouts and slow nodes.
Results showed that increasing DistCp map tasks beyond 5 000 did not improve throughput because the NameNode RPC call queue became saturated.
Results and Impact
On the migration day, the team pre‑copied roughly half of the historical data, then transferred the remaining 5–6 million files within the two‑hour window without SLA violations. The migration reduced daily RPC traffic by ~5 billion requests and freed ~20 GB of heap memory. Post‑migration RPC latency for the moved data dropped below 1 ms.
The experience highlighted the need for better automation (e.g., leveraging HDFS RBF) to reduce manual steps in future migrations.
References
https://issues.apache.org/jira/browse/HDFS-2139
https://issues.apache.org/jira/browse/HDFS-10467
https://issues.apache.org/jira/browse/HDFS-10480
https://issues.apache.org/jira/browse/HADOOP-13055
https://hadoop.apache.org/docs/current/hadoop-distcp/DistCp.html
https://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-hdfs/Federation.html
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
dbaplus Community
Enterprise-level professional community for Database, BigData, and AIOps. Daily original articles, weekly online tech talks, monthly offline salons, and quarterly XCOPS&DAMS conferences—delivered by industry experts.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
