Databases 15 min read

How Facebook Scales MySQL Backups: Strategies, Storage, and Incremental Techniques

This article explains Facebook's large‑scale MySQL backup architecture, covering the Python‑based automation framework, master‑slave deployment, logical mysqldump backups, warm and cold storage locations, source selection heuristics, full and incremental backup pipelines, verification processes, and future RBR‑based improvements.

ITPUB
ITPUB
ITPUB
How Facebook Scales MySQL Backups: Strategies, Storage, and Incremental Techniques

1. Preparation Knowledge

Facebook implements most database automation with Python; every documented manual operation has a corresponding Python library. Communication between services uses Thrift, and the BaseController framework (open‑source as sparts on GitHub) underpins the backup‑agent implementation.

2. Deployment Model

The MySQL clusters run in a master/slave topology, with each shard replicated across at least two of five data‑centers. The article refers to this collection as a Replicaset , illustrated in the first diagram.

3. Backup Forms

Facebook uses logical backups via mysqldump for all production MySQL data, despite the common expectation of faster physical backups (e.g., xtrabackup). The reasons are:

Compression ratio: Logical backups compressed with gzip achieve 2‑3× better size reduction than physical backups because indexes remain uncompressed in the latter.

Service dependencies: Logical dumps serve as the source for non‑real‑time data‑warehouse analytics; parsing physical backups would add significant effort.

Operational flexibility: Directly readable dump files simplify debugging, external tool development, and damage analysis.

4. Backup Storage Locations

Two storage tiers satisfy audit requirements and disaster‑recovery drills:

Warm Backup: Each data‑center runs an independent HDFS cluster storing the most recent ten days of shard backups.

Cold Backup: Older backups are migrated to Isilon devices for long‑term archival.

Approximately 99% of restore requests are satisfied from warm HDFS storage, typically needing data from the past week.

5. Selecting Backup Sources

Because a Replicaset contains multiple MySQL instances, Facebook hashes each shard name to a bucket and assigns the bucket to an HDFS cluster. When the backup‑agent receives a request, it determines the target HDFS address via the hash, then selects a source instance based on:

Network traffic load

Broken slave status

Data consistency requirements

6. Backup Strategies

Backup frequency depends on data change patterns:

Hot‑spot data: full backup every three days with incremental backups in between.

Highly volatile or unpredictable data: daily full backups.

A Python‑implemented backup agent runs on each MySQL server, reading configuration from a central config service, scheduling jobs, handling concurrency limits, prioritizing instances, and retrying failures.

7. Full Backup Implementation

Full backups are performed with a pipeline of mysqldump, qpress, and the HDFS client, operating per shard. Parallelism varies by hardware (e.g., 2‑3 concurrent dumps on flash storage). Logical Read‑Ahead options are added to mysqldump to reduce impact on the serving workload. See Yoshinori Matsunobu’s blog for performance details.

8. Incremental Backup Evolution

Facebook’s incremental backup differs from traditional binlog‑based approaches. Because full backups are shard‑level, using a single binlog would intertwine increments across shards, creating dependencies. Instead, Facebook built a differential backup system:

Version 1: Daily full dumps are written to HDFS; a Hadoop job computes differences and stores incremental files, then deletes the older full dump. This incurred high HDFS I/O and could not finish within a day.

Version 2: The system streams full data from MySQL, keeps the latest full dump in memory for comparison, and writes only the delta to HDFS, eliminating redundant full‑dump writes. This version is stable in production but still adds comparison overhead and consumes a full‑dump read each day.

Future work aims to leverage row‑based replication (RBR) to produce logical incremental backups directly from binlog streams.

9. Backup Verification

All backups undergo continuous verification. A daemon selects the highest‑priority backup (based on age and failure count) from warm HDFS, restores it on an idle machine, and runs a series of steps: download, decompress, load, validate, and binlog replay. Any step failure marks the backup invalid. If three consecutive validations fail for a shard, an alarm is raised. Cold backups are validated via checksum comparison, assuming they were already verified when warm.

10. Open Issues and Future Outlook

With RBR becoming production‑ready, Facebook envisions a logical incremental backup that captures only the non‑redundant portion of daily changes. Assuming the 80/20 rule (20% of users generate 80% of changes), a 100 GB database could reduce daily raw backup size to ~4 GB, and after compression to ~500 MB, saving millions of dollars.

References:

Massively Distributed Backups at Facebook Scale

Under the Hood: Automated backups

Making full table scan 10x faster in InnoDB (http://yoshinorimatsunobu.blogspot.com/2013/10/making-full-table-scan-10x-faster-in.html)

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

mysqlFacebookHDFSlogical backupRBRIncremental Backup
ITPUB
Written by

ITPUB

Official ITPUB account sharing technical insights, community news, and exciting events.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.