Databases 7 min read

From Data Deletion to No Runaway – Building a Reliable Database Backup Platform

After costly data‑deletion mishaps, 37 Interactive Entertainment engineered a robust, multi‑region backup platform that evolved from simple cron scripts to streaming xtrabackup with Celery‑driven task queues, encrypted HDFS/S3 storage, automated rotation and restore verification, ensuring reliable protection against high‑impact data loss.

37 Interactive Technology Team
37 Interactive Technology Team
37 Interactive Technology Team
From Data Deletion to No Runaway – Building a Reliable Database Backup Platform

In February 2017, a Gitlab.com engineer accidentally deleted about 300 GB of data, and the five‑layer backup mechanism failed, leading to a live rescue on YouTube. In September 2018, a senior engineer at Shunfeng Technology’s data center mistakenly deleted a production database, causing a service outage for 590 minutes.

These incidents highlight three common causes of data loss: lack of proper approval in change processes, insufficient testing, and failure to enforce separation of duties. Effective backup and verification are essential to avoid “running away” after a deletion.

This article describes how the database backup platform at 37 Interactive Entertainment was built and evolved.

01 From Data Deletion to Not Running Away

The article first outlines the seriousness of database deletions and the need for robust backup verification.

02 Evolution of the Backup System

Early backups relied on crontab + shell scripts with email alerts. This simple approach worked for fewer than 100 instances but showed limitations as the number of databases grew:

Inflexible backup scheduling and frequency.

Operational overhead of deploying separate scripts for each IDC.

Difficulty providing audit logs for compliance.

After multiple iterations, the current platform uses xtrabackup streaming backups for full backups and xtrabackup incremental backups combined with point‑in‑time recovery via binlog when needed.

03 Overall Backup Architecture

The business runs across multiple IDC locations and cloud providers (Tencent Cloud, Alibaba Cloud, AWS), spanning several time zones. Compliance requirements demand encrypted, compressed backups with multiple recent versions retained, and sometimes permanent retention, leading to storage needs exceeding 100 TB per region.

To address storage constraints, the team deployed an HDFS file system in IDC and integrated with AWS S3, Alibaba OSS, and Tencent CAS.

3.1 Backup Dispatch Process

A distributed task queue (Celery) implements a producer‑consumer model to handle concurrent backups, dynamic scaling, and error callbacks. Celery uses RabbitMQ as the broker; an earlier Redis broker caused task hangs, so RabbitMQ was adopted for stability.

3.2 Backup Rotation Process

Backup versions are periodically rotated according to cleanup rules. Changes to backup paths trigger notifications to the OPS platform to mark versions as “invalid,” which is crucial for automated restore verification.

3.3 Automated Backup Restore Verification

Beyond creating backups, the team schedules periodic restores to ensure backup validity. Full restores of large instances can be time‑consuming; a future article will present a faster recovery method.

04 Conclusion

The article summarizes the evolution from simple scripts to a full‑featured backup platform, addressing challenges such as network bandwidth limits (by implementing concurrency control and off‑peak scheduling), storage bottlenecks (by adopting distributed storage), and MySQL version upgrades (by supporting multiple versions). The platform also now backs up other components like Redis and TTServer.

Although the investment in backup infrastructure often shows little direct ROI, it is essential for mitigating the low‑probability but high‑impact risk of data loss. As the author puts it, “Respect production, respect Murphy’s law.”

CeleryData Recoverydatabase backupBackup Architecturextrabackup
37 Interactive Technology Team
Written by

37 Interactive Technology Team

37 Interactive Technology Center

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.