How Uber Scales Database Backup and Recovery to Petabytes
This article explains how Uber built a robust, continuously backed‑up and recoverable database platform that handles tens of petabytes of data, detailing the challenges, architecture, scheduling, backup and restore frameworks, and the technology‑specific snapshot logic that enable fast RPO/RTO at massive scale.
Introduction
Uber runs its real‑time services on a mix of open‑source databases (MySQL, Apache Cassandra, etcd, Apache Zookeeper) and internally built storage solutions (Docstore, Schemaless). Reliable backup and recovery are essential for business continuity, disaster recovery, compliance, and testing.
Challenges
Original backup scheduling : Simple periodic jobs ignored network, host resources, priorities, rate limits, and observability, causing load spikes and inefficient recovery.
Ad‑hoc restore processes : Scripts and outdated manuals broke as databases upgraded.
Lack of restore drills : No defined processes or regular testing.
Recovery objectives : Early RPOs of 7‑21 days and unknown RTOs were reduced to 4‑24 hours RPO and hourly 300 TB RTO through layered optimizations.
Architecture Overview
Uber’s stateful platform hosts a continuous backup‑recovery (CBCR) framework that abstracts and manages all aspects of backup and restore across clusters. It provides centralized, adaptive scheduling for backup workloads, snapshot creation, state propagation, and continuous verification of restores.
Continuous Backup
The continuous backup component runs a global scheduler that periodically triggers backup workloads. The Time Machine coordinator uses an optimal‑selection engine with client‑side rate limiting, considering freshness, network/host availability, historical consumption, peak utilization, storage policies, geography, and more to evenly distribute backup tasks without impacting production traffic.
The backup cycle consists of three stages:
Discovery – scans the entire stateful cluster to list candidate databases.
Selection – applies multi‑criteria filters and ranking to pick the final set of databases for backup.
Trigger – decides between full or incremental backup and launches the appropriate workload.
Backup Framework
The backup framework is a generic driver that loads technology‑specific plugins to perform snapshot logic and upload data to Uber’s Blobstore. It runs backup side‑car containers alongside database containers, enforces rate limits, validates integrity, and records state in a backup index. The Blobstore provides configurable policies and deduplication for incremental/differential backups.
Snapshot Logic per Technology
MySQL‑based stores use Percona XtraBackup for efficient differential snapshots, covering MySQL and Uber’s Docstore/Schemaless.
Cassandra employs a Medusa‑style differential backup with nodetool snapshot.
etcd uses etcd‑clientv3 to obtain point‑in‑time snapshots.
Zookeeper backs up the latest snapshot.<zxid> file.
Continuous Restore
The continuous restore framework periodically validates restored backups, running both dedicated and random database tests. It schedules tests based on hardware availability to avoid production impact, performs end‑to‑end restores, and conducts byte‑level data comparisons for dedicated databases.
Restore testing generates detailed metrics—success rates, recovery ratios, integrity results, and performance data—and feeds them to monitoring and analysis teams.
Restore Framework
Similar to the backup side, the restore framework is technology‑agnostic, using a modular driver with plugins for each database type. It builds a backup index, loads snapshots (e.g., Percona XtraBackup for MySQL, SSTable download for Cassandra, snapshot placement for etcd/Zookeeper), and restores databases to a usable state.
Continuous Restore Framework Benefits
Operational resilience : Reduces downtime risk through automated recovery.
Compliance & audit support : Auto‑generated reports satisfy regulatory requirements.
Data assurance : Validates integrity and correctness of restored data.
Actionable insights : Provides visibility into recovery performance and highlights improvement areas.
By continuously verifying backup and restore pipelines, Uber’s framework strengthens disaster‑recovery readiness, protects critical data, and scales recovery capabilities across petabytes of production workloads.
dbaplus Community
Enterprise-level professional community for Database, BigData, and AIOps. Daily original articles, weekly online tech talks, monthly offline salons, and quarterly XCOPS&DAMS conferences—delivered by industry experts.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
