Databases 14 min read

Qunar.com Redis Automation Operations System: Architecture, Deployment, Migration, Scaling, and Inspection

This article details Qunar.com's Redis automation operations system, covering background challenges, the high‑availability cluster architecture, resource management, automated deployment, various migration strategies, scaling mechanisms with RedisGate, inspection processes, and future AI‑driven enhancements.

Qunar Tech Salon
Qunar Tech Salon
Qunar Tech Salon
Qunar.com Redis Automation Operations System: Architecture, Deployment, Migration, Scaling, and Inspection

Background: The COVID‑19 pandemic reduced traffic, prompting Qunar to shrink Redis resources, then rapid recovery caused resource pressure and the need for automated operations.

Current Redis usage: Over 15,000 instances across a hundred‑TB scale, with memory consumption exceeding 60% and a 50% increase in instance count since the pandemic.

Challenges: rapid traffic growth, increased latency from cloud migration, containerization leading to more connections, frequent machine‑level failures, and the necessity for cross‑datacenter disaster recovery.

Automation system architecture: resources are managed via an OPS service tree; each Redis server node runs a master‑slave pair; a Zookeeper cluster stores cluster names and notifies clients of configuration changes; a Sentinel cluster provides high‑availability, failover, and configuration updates; a MySQL PXC configuration center stores shard information; client applications read topology from Zookeeper to connect to the appropriate shard.

Key automation modules:

Resource management – Redis resource pool managed through OPS, dbaAgent collects machine information, updates instance deployment status, and provides operational scripts.

Cluster deployment – a custom agent replaces salt‑minion, enabling parameterized, rule‑based deployment (e.g., memory limits, instance‑to‑CPU ratios, machine selection criteria).

Automated migration – supports single‑instance, multi‑instance, resource‑balancing, cluster‑wide, and whole‑machine migrations, all adhering to the same deployment rules.

Scaling (expansion/shrink) – data migration between old and new clusters using client‑side sharding; RedisGate middleware synchronizes data and facilitates a seamless switch.

Inspection system – aggregates metrics from monitoring and agents, analyzes risk thresholds, and provides early warnings to prevent failures.

RedisGate middleware: acts as a replica of the source cluster and as the master of the target cluster, implements the client‑side sharding algorithm, synchronizes data, and enables consistent expansion or contraction of the Redis fleet.

Future outlook: integrate AI/ML techniques for smarter monitoring, automated optimization, and self‑healing; address cross‑cloud deployment latency for latency‑sensitive services to further improve reliability and performance.

MigrationMonitoringAIRedisScalingDatabase Operations
Qunar Tech Salon
Written by

Qunar Tech Salon

Qunar Tech Salon is a learning and exchange platform for Qunar engineers and industry peers. We share cutting-edge technology trends and topics, providing a free platform for mid-to-senior technical professionals to exchange and learn.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.