Databases 35 min read

How Qunar Revamped Its Database Backup System for 27× Speed Gains

The article details Qunar’s senior DBA Qian Fangyuan’s redesign of the company’s database backup and recovery platform, explaining the shortcomings of the legacy system, the new architecture, high‑performance backup techniques using Xtrabackup, dynamic throttling, storage abstraction, and the resulting dramatic improvements in backup and restore speeds and reliability.

dbaplus Community
dbaplus Community
dbaplus Community
How Qunar Revamped Its Database Backup System for 27× Speed Gains

Background

Database backups are essential for protecting data against accidental modification or deletion. The legacy Qunar backup system used a multi‑step transfer through a middle‑node server and then uploaded the data to MFS storage before OPS encrypted and moved it to object storage. As data volume grew, this architecture exposed reliability, performance, scalability, and manageability problems.

Identified Issues

Heavy reliance on SSH stability; network jitter caused batch failures.

Uniform 30 MB/s speed limit prevented high‑performance servers from utilizing their bandwidth.

Middle‑node server became a bottleneck; insufficient disk space delayed backups.

No lifecycle management for backup files; cleanup policies were coarse.

Manual recovery was slow and required OPS intervention.

Long backup windows and risk of incomplete backups if the pipeline failed.

Solution Options

Patch the old system – fix bugs but the codebase is fragmented and costly to maintain.

Re‑architect the backup‑restore system – address root causes and support future growth.

The team chose the second option.

New System Design

Platform Architecture

The platform is layered into plugin, plugin‑control, basic services (including the BackupRestore module), advanced services, access, and presentation layers. The BackupRestore module is the first functional component of the new platform.

Backup/Restore Workflow

Scheduler – generates periodic backup and recovery‑drill tasks.

Task system – selects target instances, sends execution requests to the Plugin, records results.

Plugin – performs the actual backup or restore, monitors task status.

The scheduler is a generic framework; backup and restore tasks are specific implementations.

High‑Performance Backup

The backup process copies a snapshot from the source (MySQL, Redis, etc.) through a channel to storage. Using Xtrabackup for MySQL, the flow is:

/* Xtrabackup backup flow */
1. Check compatibility and start Redo copy thread.
2. Copy ibd files in parallel.
3. Issue FLUSH TABLES WITH READ LOCK (or lightweight lock in MySQL 8.0).
4. Copy non‑InnoDB files (MyISAM, .frm).
5. Record Binlog, GTID, LSN metadata.
6. Finish Redo copy, release lock, copy ib_buffer_pool.
7. Backup ends.

Two file formats are supported:

tar – simple concatenation of files; sequential read/write; limited parallelism.

xbstream – Xtrabackup‑specific format containing file name, offset, checksum; enables parallel read/write and higher transfer throughput.

Because xbstream allows parallel processing, it is the preferred format for high‑speed backups.

Compression and Encryption

Three implementation paths were evaluated:

Use Xtrabackup’s built‑in quicklz (or newer zstd/lz4) compression and AES‑128/192/256 encryption.

Compress/encrypt in the transmission channel after Xtrabackup output.

Leverage storage‑side transparent compression/encryption (if supported).

The final choice was Xtrabackup’s native quicklz compression and AES‑256 encryption, satisfying security requirements while keeping the pipeline simple.

Efficient Upload

Object storage accepts chunked uploads. The new channel reads the Xtrabackup stream, splits it into fixed‑size blocks, assigns sequential IDs, buffers them in a ring buffer, and uploads blocks concurrently. This design decouples reading from uploading, maximises throughput, and prevents unbounded memory growth.

Dynamic Throttling (Rate Limiting)

To keep backup impact invisible to online services, the system monitors CPU, I/O, network, and MySQL thread metrics. For each metric a threshold , unit , and trend are computed. The expected duration for the next block ( expect) is compared with the actual duration ( spend). If spend < expect, the system sleeps for expect‑spend; otherwise it proceeds. The overall speed is the minimum of the speeds calculated for all metrics.

Four algorithms can compute the new speed:

rand – random value between current speed and threshold (not used in production).

times – multiply current speed by a factor when decreasing.

dichotomy – binary search between current speed and threshold.

fixed – add/subtract a fixed increment when increasing.

The system uses fixed for acceleration and times for deceleration, achieving rapid response to resource spikes.

Storage Abstraction

A generic Storage interface allows plugging in object storage, non‑object storage, or a transit gateway without changing the backup logic. Implementations only need to provide methods for opening/closing clients, reading/writing files, directory statistics, creation, deletion, and cleanup.

type Storage interface {
    // Open and close the storage client
    OpenClient() error
    CloseClient() error

    // File operations with optional configuration
    OpenFileReaderConfig(filePath string, config interface{}) error
    Read(p []byte) (int, error)
    CloseFileReader() error
    OpenFileWriterConfig(filePath string, config interface{}) error
    Write(p []byte) (int, error)
    Flush() error
    CloseFileWriter() error

    // Directory utilities
    DirectoryStats(dirPath string, arg interface{}) (DirectoryStat, error)
    MakeDirectory(dirPath string) error
    Remove(dirPath string) error
    Clean() error
}

Any storage backend that implements this interface can be used transparently.

Task Notification

During long‑running backup or restore, the Plugin periodically sends progress updates containing transferred bytes and estimated completion. Notifications are triggered either after a configured data volume or after a time interval, allowing external systems to monitor task health without polling.

Performance Results

Benchmarks on a production‑grade server showed:

Non‑encrypted backup throughput ≈ 860 MB/s (≈ 2 TB/h), a 27× improvement over the legacy 30 MB/s.

Encrypted backup throughput ≈ 420 MB/s, a 13× improvement.

Encrypted restore throughput ≈ 300 MB/s, a 10× improvement.

Dynamic throttling tests demonstrated immediate speed reduction when simulated network load exceeded the 80 MB/s threshold, and rapid recovery to the target speed once the load subsided.

Additional Considerations

Sparse Cleanup Policy – configurable retention windows (e.g., keep all backups for the last 10 days, then keep one every 3 days for days 10‑90, etc.) to extend recoverable history with minimal storage cost.

Generic Storage Layer – enables seamless migration between cloud providers or on‑premise storage without code changes.

Applicability of Dynamic Throttling – the throttling module can be reused for any workload that requires adaptive speed control while preserving service stability.

Overall, the rebuilt backup‑restore system delivers high‑performance, low‑impact backups and fast restores, meeting Qunar’s scalability and reliability goals.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

Performance OptimizationDatabase BackupxtrabackupDynamic ThrottlingStorage Interface
dbaplus Community
Written by

dbaplus Community

Enterprise-level professional community for Database, BigData, and AIOps. Daily original articles, weekly online tech talks, monthly offline salons, and quarterly XCOPS&DAMS conferences—delivered by industry experts.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.