Databases 33 min read

Design and Implementation of a High‑Performance Database Backup and Recovery System

This article presents a comprehensive analysis of the shortcomings of an existing database backup solution and details the architecture, high‑performance backup and restore mechanisms, dynamic throttling, storage abstraction, and experimental results of a newly designed, scalable backup‑recovery platform for MySQL databases.

Qunar Tech Salon
Qunar Tech Salon
Qunar Tech Salon
Design and Implementation of a High‑Performance Database Backup and Recovery System

Backup is essential for database systems to recover from accidental data loss, but the legacy backup pipeline suffered from reliability, performance, scalability, and security issues as data volumes grew.

Background : The old system used a middle‑node transfer model with fixed 30 MB/s throttling, heavy reliance on SSH, and a post‑process that encrypted and stored backups in MFS before uploading to object storage.

Problems identified :

SSH instability caused batch failures.

Uniform speed limit ignored high‑performance servers.

Middle‑node bottleneck and limited storage capacity.

No lifecycle management for backup files.

Manual, slow recovery process.

Long end‑to‑end backup time due to many stages.

Poor extensibility for new database versions and storage back‑ends.

Two solution paths were considered: patching the old system (high cost, limited impact) or rebuilding the system from scratch. The redesign was chosen.

Design Overview :

Three‑layer platform architecture (plugin, plugin‑control, basic services) with independent deployment.

Backup‑restore module placed in the basic services layer.

High‑Performance Backup :

Backup flow: source → channel → storage. The source reads data, the channel streams it, and the storage writes it.

Source side uses Percona XtraBackup. The backup steps include Redo copy, ibd copy, FTWRL lock, non‑InnoDB files, metadata capture, and finalization.

Two backup file formats are supported:

/* tar Header Block, from POSIX 1003.1-1990. */
/* POSIX header. */
struct posix_header {
  char name[100];               /*   0 */
  char mode[8];                 /* 100 */
  char uid[8];                  /* 108 */
  char gid[8];                  /* 116 */
  char size[12];                /* 124 */
  char mtime[12];               /* 136 */
  char chksum[8];               /* 148 */
  char typeflag;                /* 156 */
  char linkname[100];           /* 157 */
  char magic[6];                /* 257 */
  char version[2];              /* 263 */
  char uname[32];               /* 265 */
  char gname[32];               /* 297 */
  char devmajor[8];             /* 329 */
  char devminor[8];             /* 337 */
  char prefix[155];             /* 345 */
};

The tar format is sequential; the xbstream format stores file chunks with offsets, enabling parallel read/write.

typedef struct {
  uchar flags;
  xb_chunk_type_t type;
  uint pathlen;
  char path[FN_REFLEN];
  size_t length;
  size_t raw_length;
  my_off_t offset;
  my_off_t checksum_offset;
  void *data;
  void *raw_data;
  ulong checksum;
  ulong checksum_part;
  size_t buflen;
  size_t sparse_map_alloc_size;
  size_t sparse_map_size;
  ds_sparse_chunk_t *sparse_map;
} xb_rstream_chunk_t;

Because xbstream allows parallel processing, it is chosen for high‑throughput backups.

Compression and Encryption :

XtraBackup native quicklz (or zstd/lz4 in newer versions) with AES‑128/192/256 encryption.

Channel‑side compression/encryption for engines lacking native support.

Optional storage‑side transparent compression/encryption.

Given security requirements (keys managed by the security team) and the need for fast compression, the native XtraBackup approach was selected.

Efficient Upload Channel :

The channel splits the incoming stream into numbered blocks, stores them in a ring buffer, and uploads them concurrently to object storage, allowing out‑of‑order block transmission.

To avoid impacting the database, a dynamic throttling mechanism monitors CPU, I/O, network, and memory usage. If a resource exceeds its threshold, the system computes a new expect time for each block and sleeps accordingly, effectively limiting the transfer rate.

Dynamic throttling uses multiple algorithms (rand, times, dichotomy, fixed) to adjust speed based on real‑time trends, ensuring the backup proceeds as fast as possible without harming the live service.

Task Notification :

During long‑running backups, progress notifications are sent either by transferred data size or by elapsed time, allowing operators to track task status.

High‑Performance Restore :

Restore mirrors the backup flow; xbstream requires XtraBackup 8.0+ for streaming extraction. Phase‑wise timing metrics are recorded to identify further optimization opportunities.

Results :

Non‑encrypted backup throughput increased from ~30 MB/s (old system) to ~860 MB/s (new system), a >27× improvement.

Encrypted backup throughput rose to ~420 MB/s, a 13× gain.

Restore speed grew from ~30 MB/s to ~300 MB/s, a 10× improvement.

Dynamic throttling kept resource usage within safe limits while maintaining high throughput.

Conclusion :

The redesigned backup‑recovery platform delivers orders‑of‑magnitude performance gains, full lifecycle management, storage‑agnostic interfaces, and adaptive throttling, making it suitable for large‑scale MySQL deployments and extensible to other database engines.

MySQLdatabase backupHigh Performancextrabackupdynamic throttlingstorage abstraction
Qunar Tech Salon
Written by

Qunar Tech Salon

Qunar Tech Salon is a learning and exchange platform for Qunar engineers and industry peers. We share cutting-edge technology trends and topics, providing a free platform for mid-to-senior technical professionals to exchange and learn.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.