How Uber Scaled Data Replication to Petabytes Daily with Distcp Optimizations
Uber tackled the challenge of replicating over 350 PB of data across on‑premise and cloud lakes by redesigning Hadoop Distcp, moving intensive tasks to the Application Master, parallelising copy‑listing and commit phases, and leveraging Uber‑mapper jobs to dramatically cut latency and improve resource efficiency.
Introduction
Uber operates a data lake spanning on‑premise HDFS clusters and cloud object stores. To keep data available for disaster‑recovery, Uber uses the HiveSync service, which relies on Apache Hadoop® Distcp (Distributed Copy) for bulk and incremental replication.
When the lake grew beyond 350 PB, Distcp’s original implementation became a bottleneck, prompting a set of engineering optimisations to support daily replication volumes that later exceeded 1 PB.
Distcp Overview
Distcp is an open‑source Hadoop MapReduce application that copies large data sets between source and destination file systems in parallel. Its main components are:
Distcp Tool – discovers files, builds a copy‑listing, defines mapper distribution and submits a Hadoop job.
Hadoop Client – creates the job environment and performs input‑splitting.
Resource Manager (RM) – schedules containers on YARN.
Application Master (AM) – monitors the MapReduce job, requests containers for copy mappers and runs the copy‑committer.
Copy Mapper – copies file blocks inside YARN containers.
Copy Committer – concatenates the copied blocks into final files on the destination.
HiveSync’s Use of Distcp
HiveSync, originally built on Airbnb’s ReAir project, submits Distcp jobs for any dataset larger than 256 MB. Multiple asynchronous workers prepare the job, submit it through the Hadoop client to YARN, and a monitoring thread tracks progress. Successful jobs make data instantly visible in the target cluster.
Scalability Challenges (Q3 2022)
Daily replication volume jumped from 250 TB to 1 PB after Uber switched to an active‑passive data‑lake architecture, concentrating writes in a single region. The number of managed datasets grew from 30 k to 144 k, causing daily Distcp job count to rise from ~10 k to an average of 374 k. This broke the SLA targets (P100 = 4 h, P99.9 = 20 min).
Key Optimisations
1. Move Distcp Preparation to the Application Master
Copy‑listing and input‑splitting originally ran on the HiveSync server, contending for HDFS client RPC locks and accounting for ~90 % of job‑submission latency. By relocating these tasks to the AM container, lock contention was eliminated and submission latency dropped by roughly 90 %.
2. Parallelise Copy‑Listing
The original implementation listed files sequentially, issuing a NameNode call for each block. Uber introduced a multi‑threaded pipeline:
Each file is assigned a dedicated thread that calls getFileBlockLocations and creates block descriptors.
Block descriptors are placed into a blocking queue.
A single writer thread consumes the queue and writes the descriptors to a sequence file in order.
Using six threads across all HiveSync servers reduced the P99 copy‑listing latency by 60 % (maximum reduction 75 %).
3. Parallelise Copy Committer
For directories containing >500 k files, the commit phase could take up to 30 minutes in the upstream code because files were concatenated sequentially. Uber parallelised the concatenation step by launching a thread per file (up to ten concurrent threads). This cut average commit latency by 97.29 %.
4. Uber‑Mapper Jobs for Small Transfers
Approximately 52 % of Distcp jobs copy ≤512 MB and <200 files, yet each job incurred a full YARN container allocation. Enabling Hadoop’s “Uber task” mode runs the copy mapper inside the AM JVM, avoiding container launch overhead. The required configuration is:
mapreduce.job.ubertask.enable=true
mapreduce.job.ubertask.maxmaps=1
mapreduce.job.ubertask.maxbytes=512MThis optimisation saved roughly 268 000 container starts per day.
Results
The combined changes increased incremental replication capacity five‑fold within a year, moving more than 306 PB from on‑premise to cloud without scaling‑related failures. New observability metrics (job submission time, copy‑listing latency, commit latency, container memory usage, P99 copy speed) were added to aid capacity planning and fault diagnosis.
Operational Challenges and Mitigations
AM Out‑Of‑Memory (OOM) crashes were addressed by adding OOM detection metrics and tuning YARN memory/core allocations.
High job‑submission rates caused YARN queue saturation; a circuit‑breaker throttles submissions until the queue recovers.
Long‑running copy‑listing tasks timed out before the AM heartbeat started. The copy‑listing phase was moved to a later stage after the heartbeat was active, preventing AM timeouts.
Future Work
Planned enhancements include:
Parallelising file‑permission setting.
Parallelising input‑splitting.
Moving compute‑intensive commit work to the Reduce phase for better scalability.
Implementing a dynamic bandwidth throttler.
Uber intends to upstream these patches to the open‑source Distcp project.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
