Databases 14 min read

How Uber Scales Database Backup and Recovery to Petabytes

This article explains how Uber built a robust, continuously backed‑up and recoverable database platform that handles tens of petabytes of data, detailing the challenges, architecture, scheduling, backup and restore frameworks, and the technology‑specific snapshot logic that enable fast RPO/RTO at massive scale.

dbaplus Community
dbaplus Community
dbaplus Community
How Uber Scales Database Backup and Recovery to Petabytes

Introduction

Uber runs its real‑time services on a mix of open‑source databases (MySQL, Apache Cassandra, etcd, Apache Zookeeper) and internally built storage solutions (Docstore, Schemaless). Reliable backup and recovery are essential for business continuity, disaster recovery, compliance, and testing.

Challenges

Original backup scheduling : Simple periodic jobs ignored network, host resources, priorities, rate limits, and observability, causing load spikes and inefficient recovery.

Ad‑hoc restore processes : Scripts and outdated manuals broke as databases upgraded.

Lack of restore drills : No defined processes or regular testing.

Recovery objectives : Early RPOs of 7‑21 days and unknown RTOs were reduced to 4‑24 hours RPO and hourly 300 TB RTO through layered optimizations.

Architecture Overview

Uber’s stateful platform hosts a continuous backup‑recovery (CBCR) framework that abstracts and manages all aspects of backup and restore across clusters. It provides centralized, adaptive scheduling for backup workloads, snapshot creation, state propagation, and continuous verification of restores.

Image
Image

Continuous Backup

The continuous backup component runs a global scheduler that periodically triggers backup workloads. The Time Machine coordinator uses an optimal‑selection engine with client‑side rate limiting, considering freshness, network/host availability, historical consumption, peak utilization, storage policies, geography, and more to evenly distribute backup tasks without impacting production traffic.

The backup cycle consists of three stages:

Discovery – scans the entire stateful cluster to list candidate databases.

Selection – applies multi‑criteria filters and ranking to pick the final set of databases for backup.

Trigger – decides between full or incremental backup and launches the appropriate workload.

Image
Image

Backup Framework

The backup framework is a generic driver that loads technology‑specific plugins to perform snapshot logic and upload data to Uber’s Blobstore. It runs backup side‑car containers alongside database containers, enforces rate limits, validates integrity, and records state in a backup index. The Blobstore provides configurable policies and deduplication for incremental/differential backups.

Snapshot Logic per Technology

MySQL‑based stores use Percona XtraBackup for efficient differential snapshots, covering MySQL and Uber’s Docstore/Schemaless.

Cassandra employs a Medusa‑style differential backup with nodetool snapshot.

etcd uses etcd‑clientv3 to obtain point‑in‑time snapshots.

Zookeeper backs up the latest snapshot.<zxid> file.

Continuous Restore

The continuous restore framework periodically validates restored backups, running both dedicated and random database tests. It schedules tests based on hardware availability to avoid production impact, performs end‑to‑end restores, and conducts byte‑level data comparisons for dedicated databases.

Restore testing generates detailed metrics—success rates, recovery ratios, integrity results, and performance data—and feeds them to monitoring and analysis teams.

Restore Framework

Similar to the backup side, the restore framework is technology‑agnostic, using a modular driver with plugins for each database type. It builds a backup index, loads snapshots (e.g., Percona XtraBackup for MySQL, SSTable download for Cassandra, snapshot placement for etcd/Zookeeper), and restores databases to a usable state.

Image
Image

Continuous Restore Framework Benefits

Operational resilience : Reduces downtime risk through automated recovery.

Compliance & audit support : Auto‑generated reports satisfy regulatory requirements.

Data assurance : Validates integrity and correctness of restored data.

Actionable insights : Provides visibility into recovery performance and highlights improvement areas.

By continuously verifying backup and restore pipelines, Uber’s framework strengthens disaster‑recovery readiness, protects critical data, and scales recovery capabilities across petabytes of production workloads.

databasesBackupUberRecoverycontinuous backup
dbaplus Community
Written by

dbaplus Community

Enterprise-level professional community for Database, BigData, and AIOps. Daily original articles, weekly online tech talks, monthly offline salons, and quarterly XCOPS&DAMS conferences—delivered by industry experts.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.