How Google Music Recovered 1.5 PB of Lost Data After a Massive Deletion Bug
In March 2012, a privacy‑driven deletion pipeline mistakenly erased hundreds of thousands of Google Music files, prompting SREs to launch a massive data‑recovery effort that involved MapReduce impact analysis, tape‑based backups, and a complete redesign of the deletion system.
SRE: Google Operations Secrets – translated by senior Google SRE Sun Yucong, provides a deep dive into this incident.
Google Music – March 2012: Detection of an Accidental Deletion Incident
This incident highlighted logistical challenges of massive data storage, such as where to keep 5,000 tape cartridges and how to read data from offline media within a reasonable timeframe.
1. Discovering the Problem: Disaster Strikes
A Google Music user reported that previously playable songs could no longer be streamed. The support team escalated the issue to engineers, who initially fixed a missing pointer in the song metadata.
Engineers, however, continued probing and discovered that the data integrity damage was caused by a privacy‑focused deletion pipeline that had removed the audio files.
When the true cause of data corruption was found, engineers were shocked. The pipeline was designed to delete massive amounts of audio data as quickly as possible.
2. Assessing Severity
Google’s privacy policy requires that music files and their metadata be fully deleted within a reasonable time after a user requests removal.
As Google Music usage grew exponentially, the original deletion mechanism was abandoned and redesigned in 2012.
On February 6, a redesigned deletion pipeline ran without apparent issues, and engineers approved the second stage that actually removed the audio data.
When the problem resurfaced, a high‑priority alert was issued, the pipeline was halted, and a small SRE team was formed to investigate.
Manual inspection of millions of metadata records was impossible, so the team quickly wrote a MapReduce job to gauge the impact.
On March 8 the job finished, revealing that roughly 600,000 audio files (affecting about 21,000 users) had been mistakenly deleted.
Is there any hope of restoring the data?
3. Solving the Problem
· Locating the Bug and Parallel Data Recovery
The first step was to pinpoint the root cause; without fixing it, any recovery effort would be futile.
While users demanded the deletion pipeline be re‑enabled, the team had to recover the lost data from tape backups, which were stored off‑site on thousands of cartridges.
Two groups were formed: senior SREs handled data recovery, while developers analyzed the deletion logic.
The first batch of over 500,000 files was selected for recovery; the tape‑backup team received the recovery notice on March 8 at 4:34 PM PST.
Fortunately, a recent disaster‑recovery drill had produced a new recovery tool that accelerated the process.
Using this new tool, the team began matching each audio file to its corresponding tape and then to the physical cartridge.
The first batch required transporting 5,000 tapes by truck back to the data center, clearing space for them, and manually loading them into the tape‑library.
Because of the prior drill, manual loading proved faster than the robotic system.
After three hours the tape library resumed operation and began writing the recovered files to distributed storage.
By March 10, 74 % of the 436,223 files from 3,475 tapes had been successfully transferred.
Some tapes were lost or damaged, requiring additional trips to retrieve redundant copies.
By March 11, over 99.95 % of the recovery tasks were complete, and the remaining redundant tapes were being fetched.
Production alerts unrelated to the data loss still consumed two days of the recovery team’s time.
By March 13 the full 436,223 files were accessible to users, completing the seven‑day recovery effort (five days of actual data restoration).
· Second Batch Recovery
After the first batch, the team focused on the remaining 161,000 files that had been deleted before any backup existed. Most were store‑front or promotional audio and were quickly recreated.
A small subset of user‑uploaded files required the client software to automatically re‑upload the missing content over the following week.
· Fixing the Root Cause
The team eventually identified a bug in the redesigned deletion pipeline. Large‑scale offline data processing systems delete data in multiple stages, each touching different storage services.
When many machines run these stages in parallel, they can create data‑race conditions, especially if short‑term data lifetimes are not carefully tuned.
For example, the two stages are deliberately isolated by three hours to simplify logic and enable parallelism.
As data volumes grew, the original timing assumptions broke down, increasing the likelihood of race‑induced deletions.
After the incident, Google Music rebuilt the pipeline to eliminate the race condition and strengthened monitoring and alerting to catch similar issues before they affect users.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Efficient Ops
This public account is maintained by Xiaotianguo and friends, regularly publishing widely-read original technical articles. We focus on operations transformation and accompany you throughout your operations career, growing together happily.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
