Mastering Database Backup: Strategies, Retention, and Rapid Recovery
This article explains a comprehensive database backup workflow—including daily full backups with xtrabackup, real‑time binlog incremental backups, multi‑level retention policies, automated failure detection, disaster‑recovery across multiple data centers, and fast table‑level restoration—to help prevent prolonged outages.
Backup Mechanism
We perform a full backup every 24 hours combined with real‑time binlog incremental backups.
Full Backup
The full backup uses the mature xtrabackup tool. Before each backup, a strategy‑update program examines the current cluster topology and slave status to select the appropriate instance and dynamically chooses between local backup and remote streaming backup.
Local backup: After the full backup completes, the data is encrypted, compressed, and transferred to storage.
Remote streaming backup: Uses xbstream to stream the backup directly to remote storage.
Incremental Backup
Real‑time binlog backups are performed with mysqlbinlog --read-from-remote-server , streaming binlog data to storage as it is generated.
Backup Retention Policy
Full backups follow a 4‑2‑2‑1 retention scheme, keeping the most recent 4 days, 2 weeks, 2 months, and 1 year (a total of 9 copies) to maximize recovery capability.
Incremental binlog backups retain the last 60 days of binlog data, enabling point‑in‑time recovery to any moment within that window.
Backup Failure Detection
Each backup step is monitored; any failure records an error code in the database for DBA troubleshooting. Key checks include:
Confirming the backup instance role is a slave to avoid impacting production.
Verifying replication sync status.
Detecting page‑corruption errors.
Random MD5 checks of data files before and after transfer.
Daily statistics are compiled, and any database instance that experiences two consecutive backup failures is highlighted on the HULK platform for immediate handling by on‑call staff.
Disaster‑Recovery Strategy
To guard against regional incidents such as fiber cuts or power outages, backup targets are distributed across multiple IDC locations nationwide, ensuring data remains recoverable even if a single site fails.
Rapid Recovery
Backups are stored per table in compressed packages. During restoration, only the required table files and essential metadata are transferred and decompressed, dramatically reducing recovery time.
The restoration process is fully automated via the HULK platform’s command‑execution system. Users can submit a self‑service restore request for a specific point in time; the system restores the data and creates a temporary instance for the business to replace the failed production instance.
Conclusion
A robust backup strategy, thorough detection mechanisms, and reliable monitoring are essential, but they must be complemented by administrators’ awareness and regular restore drills. Future posts will dive deeper into the technical details of the implementation.
360 Zhihui Cloud Developer
360 Zhihui Cloud is an enterprise open service platform that aims to "aggregate data value and empower an intelligent future," leveraging 360's extensive product and technology resources to deliver platform services to customers.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.