How Qunar Revamped Its Database Backup System for 27× Speed Gains
The article details Qunar’s senior DBA Qian Fangyuan’s redesign of the company’s database backup and recovery platform, explaining the shortcomings of the legacy system, the new architecture, high‑performance backup techniques using Xtrabackup, dynamic throttling, storage abstraction, and the resulting dramatic improvements in backup and restore speeds and reliability.
Background
Database backups are essential for protecting data against accidental modification or deletion. The legacy Qunar backup system used a multi‑step transfer through a middle‑node server and then uploaded the data to MFS storage before OPS encrypted and moved it to object storage. As data volume grew, this architecture exposed reliability, performance, scalability, and manageability problems.
Identified Issues
Heavy reliance on SSH stability; network jitter caused batch failures.
Uniform 30 MB/s speed limit prevented high‑performance servers from utilizing their bandwidth.
Middle‑node server became a bottleneck; insufficient disk space delayed backups.
No lifecycle management for backup files; cleanup policies were coarse.
Manual recovery was slow and required OPS intervention.
Long backup windows and risk of incomplete backups if the pipeline failed.
Solution Options
Patch the old system – fix bugs but the codebase is fragmented and costly to maintain.
Re‑architect the backup‑restore system – address root causes and support future growth.
The team chose the second option.
New System Design
Platform Architecture
The platform is layered into plugin, plugin‑control, basic services (including the BackupRestore module), advanced services, access, and presentation layers. The BackupRestore module is the first functional component of the new platform.
Backup/Restore Workflow
Scheduler – generates periodic backup and recovery‑drill tasks.
Task system – selects target instances, sends execution requests to the Plugin, records results.
Plugin – performs the actual backup or restore, monitors task status.
The scheduler is a generic framework; backup and restore tasks are specific implementations.
High‑Performance Backup
The backup process copies a snapshot from the source (MySQL, Redis, etc.) through a channel to storage. Using Xtrabackup for MySQL, the flow is:
/* Xtrabackup backup flow */
1. Check compatibility and start Redo copy thread.
2. Copy ibd files in parallel.
3. Issue FLUSH TABLES WITH READ LOCK (or lightweight lock in MySQL 8.0).
4. Copy non‑InnoDB files (MyISAM, .frm).
5. Record Binlog, GTID, LSN metadata.
6. Finish Redo copy, release lock, copy ib_buffer_pool.
7. Backup ends.Two file formats are supported:
tar – simple concatenation of files; sequential read/write; limited parallelism.
xbstream – Xtrabackup‑specific format containing file name, offset, checksum; enables parallel read/write and higher transfer throughput.
Because xbstream allows parallel processing, it is the preferred format for high‑speed backups.
Compression and Encryption
Three implementation paths were evaluated:
Use Xtrabackup’s built‑in quicklz (or newer zstd/lz4) compression and AES‑128/192/256 encryption.
Compress/encrypt in the transmission channel after Xtrabackup output.
Leverage storage‑side transparent compression/encryption (if supported).
The final choice was Xtrabackup’s native quicklz compression and AES‑256 encryption, satisfying security requirements while keeping the pipeline simple.
Efficient Upload
Object storage accepts chunked uploads. The new channel reads the Xtrabackup stream, splits it into fixed‑size blocks, assigns sequential IDs, buffers them in a ring buffer, and uploads blocks concurrently. This design decouples reading from uploading, maximises throughput, and prevents unbounded memory growth.
Dynamic Throttling (Rate Limiting)
To keep backup impact invisible to online services, the system monitors CPU, I/O, network, and MySQL thread metrics. For each metric a threshold , unit , and trend are computed. The expected duration for the next block ( expect) is compared with the actual duration ( spend). If spend < expect, the system sleeps for expect‑spend; otherwise it proceeds. The overall speed is the minimum of the speeds calculated for all metrics.
Four algorithms can compute the new speed:
rand – random value between current speed and threshold (not used in production).
times – multiply current speed by a factor when decreasing.
dichotomy – binary search between current speed and threshold.
fixed – add/subtract a fixed increment when increasing.
The system uses fixed for acceleration and times for deceleration, achieving rapid response to resource spikes.
Storage Abstraction
A generic Storage interface allows plugging in object storage, non‑object storage, or a transit gateway without changing the backup logic. Implementations only need to provide methods for opening/closing clients, reading/writing files, directory statistics, creation, deletion, and cleanup.
type Storage interface {
// Open and close the storage client
OpenClient() error
CloseClient() error
// File operations with optional configuration
OpenFileReaderConfig(filePath string, config interface{}) error
Read(p []byte) (int, error)
CloseFileReader() error
OpenFileWriterConfig(filePath string, config interface{}) error
Write(p []byte) (int, error)
Flush() error
CloseFileWriter() error
// Directory utilities
DirectoryStats(dirPath string, arg interface{}) (DirectoryStat, error)
MakeDirectory(dirPath string) error
Remove(dirPath string) error
Clean() error
}Any storage backend that implements this interface can be used transparently.
Task Notification
During long‑running backup or restore, the Plugin periodically sends progress updates containing transferred bytes and estimated completion. Notifications are triggered either after a configured data volume or after a time interval, allowing external systems to monitor task health without polling.
Performance Results
Benchmarks on a production‑grade server showed:
Non‑encrypted backup throughput ≈ 860 MB/s (≈ 2 TB/h), a 27× improvement over the legacy 30 MB/s.
Encrypted backup throughput ≈ 420 MB/s, a 13× improvement.
Encrypted restore throughput ≈ 300 MB/s, a 10× improvement.
Dynamic throttling tests demonstrated immediate speed reduction when simulated network load exceeded the 80 MB/s threshold, and rapid recovery to the target speed once the load subsided.
Additional Considerations
Sparse Cleanup Policy – configurable retention windows (e.g., keep all backups for the last 10 days, then keep one every 3 days for days 10‑90, etc.) to extend recoverable history with minimal storage cost.
Generic Storage Layer – enables seamless migration between cloud providers or on‑premise storage without code changes.
Applicability of Dynamic Throttling – the throttling module can be reused for any workload that requires adaptive speed control while preserving service stability.
Overall, the rebuilt backup‑restore system delivers high‑performance, low‑impact backups and fast restores, meeting Qunar’s scalability and reliability goals.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
dbaplus Community
Enterprise-level professional community for Database, BigData, and AIOps. Daily original articles, weekly online tech talks, monthly offline salons, and quarterly XCOPS&DAMS conferences—delivered by industry experts.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
