Big Data 17 min read

How Uber Scaled Data Replication to Petabytes Daily with Distcp Optimizations

Uber tackled the challenge of replicating over 350 PB of data across on‑premise and cloud lakes by redesigning Hadoop Distcp, moving intensive tasks to the Application Master, parallelising copy‑listing and commit phases, and leveraging Uber‑mapper jobs to dramatically cut latency and improve resource efficiency.

Radish, Keep Going!
Radish, Keep Going!
Radish, Keep Going!
How Uber Scaled Data Replication to Petabytes Daily with Distcp Optimizations

Introduction

Uber operates a data lake spanning on‑premise HDFS clusters and cloud object stores. To keep data available for disaster‑recovery, Uber uses the HiveSync service, which relies on Apache Hadoop® Distcp (Distributed Copy) for bulk and incremental replication.

When the lake grew beyond 350 PB, Distcp’s original implementation became a bottleneck, prompting a set of engineering optimisations to support daily replication volumes that later exceeded 1 PB.

Distcp Overview

Distcp is an open‑source Hadoop MapReduce application that copies large data sets between source and destination file systems in parallel. Its main components are:

Distcp Tool – discovers files, builds a copy‑listing, defines mapper distribution and submits a Hadoop job.

Hadoop Client – creates the job environment and performs input‑splitting.

Resource Manager (RM) – schedules containers on YARN.

Application Master (AM) – monitors the MapReduce job, requests containers for copy mappers and runs the copy‑committer.

Copy Mapper – copies file blocks inside YARN containers.

Copy Committer – concatenates the copied blocks into final files on the destination.

HiveSync’s Use of Distcp

HiveSync, originally built on Airbnb’s ReAir project, submits Distcp jobs for any dataset larger than 256 MB. Multiple asynchronous workers prepare the job, submit it through the Hadoop client to YARN, and a monitoring thread tracks progress. Successful jobs make data instantly visible in the target cluster.

Scalability Challenges (Q3 2022)

Daily replication volume jumped from 250 TB to 1 PB after Uber switched to an active‑passive data‑lake architecture, concentrating writes in a single region. The number of managed datasets grew from 30 k to 144 k, causing daily Distcp job count to rise from ~10 k to an average of 374 k. This broke the SLA targets (P100 = 4 h, P99.9 = 20 min).

Key Optimisations

1. Move Distcp Preparation to the Application Master

Copy‑listing and input‑splitting originally ran on the HiveSync server, contending for HDFS client RPC locks and accounting for ~90 % of job‑submission latency. By relocating these tasks to the AM container, lock contention was eliminated and submission latency dropped by roughly 90 %.

2. Parallelise Copy‑Listing

The original implementation listed files sequentially, issuing a NameNode call for each block. Uber introduced a multi‑threaded pipeline:

Each file is assigned a dedicated thread that calls getFileBlockLocations and creates block descriptors.

Block descriptors are placed into a blocking queue.

A single writer thread consumes the queue and writes the descriptors to a sequence file in order.

Using six threads across all HiveSync servers reduced the P99 copy‑listing latency by 60 % (maximum reduction 75 %).

3. Parallelise Copy Committer

For directories containing >500 k files, the commit phase could take up to 30 minutes in the upstream code because files were concatenated sequentially. Uber parallelised the concatenation step by launching a thread per file (up to ten concurrent threads). This cut average commit latency by 97.29 %.

4. Uber‑Mapper Jobs for Small Transfers

Approximately 52 % of Distcp jobs copy ≤512 MB and <200 files, yet each job incurred a full YARN container allocation. Enabling Hadoop’s “Uber task” mode runs the copy mapper inside the AM JVM, avoiding container launch overhead. The required configuration is:

mapreduce.job.ubertask.enable=true
mapreduce.job.ubertask.maxmaps=1
mapreduce.job.ubertask.maxbytes=512M

This optimisation saved roughly 268 000 container starts per day.

Results

The combined changes increased incremental replication capacity five‑fold within a year, moving more than 306 PB from on‑premise to cloud without scaling‑related failures. New observability metrics (job submission time, copy‑listing latency, commit latency, container memory usage, P99 copy speed) were added to aid capacity planning and fault diagnosis.

Operational Challenges and Mitigations

AM Out‑Of‑Memory (OOM) crashes were addressed by adding OOM detection metrics and tuning YARN memory/core allocations.

High job‑submission rates caused YARN queue saturation; a circuit‑breaker throttles submissions until the queue recovers.

Long‑running copy‑listing tasks timed out before the AM heartbeat started. The copy‑listing phase was moved to a later stage after the heartbeat was active, preventing AM timeouts.

Future Work

Planned enhancements include:

Parallelising file‑permission setting.

Parallelising input‑splitting.

Moving compute‑intensive commit work to the Reduce phase for better scalability.

Implementing a dynamic bandwidth throttler.

Uber intends to upstream these patches to the open‑source Distcp project.

big datadata replicationHadoopUberDistcpHiveSync
Radish, Keep Going!
Written by

Radish, Keep Going!

Personal sharing

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.