Qunar's Multi-IDC Deployment and Fault Self‑Healing Architecture
This article describes how Qunar scaled its IDC infrastructure, introduced multi‑IDC deployment, automated DNS‑based load balancing, open‑source DNSDB, and an IDC proxy built on Squid to achieve rapid fault self‑healing and transparent traffic switching for both user and third‑party access.
When the author joined Qunar in 2010, the company operated only a few hundred servers in a single IDC; as business grew, the risk of a single‑site failure became unacceptable, prompting the design of a multi‑IDC deployment and backbone network to provide traffic interconnection and link redundancy.
Traditional incident response required operators to receive alerts, VPN into the internal network, assess monitoring data, and manually switch traffic, a process that took at least ten minutes and caused significant e‑commerce loss; the goal was to automate this workflow.
Qunar's operations team therefore built a fault self‑healing system that addresses both upstream and downstream network links, eliminating manual intervention.
For downstream user access, all traffic passes through DNS and a load‑balancing system deployed in each IDC. The original Nginx‑based heartbeat architecture reached its limits, leading to the adoption of ECMP and later OpenResty Enterprise, which provides hot‑reloading of upstream and other configurations, eliminating service interruptions during reloads.
The DNSDB system, now open‑source on GitHub, enables one‑click switching of thousands of domain records via a web UI and API; combined with a nationwide monitoring platform, it can trigger DNS changes within 30 seconds when thresholds are breached, and automatically restore configuration after stable periods.
To ensure reliable upstream access to third‑party services, Qunar implemented an IDC proxy solution using Squid (after evaluating ATS) in a multi‑threaded, ECMP‑enabled cluster capable of 50 KQPS on a 32‑core server; business services integrate with this proxy through a qconfig module that supports switches, whitelists, and blacklists.
Monitoring integrates with the proxy and qconfig so that when an IDC failure is detected, the system automatically updates proxy rules, transparently redirecting traffic to healthy sites without service disruption.
The combined systems have reduced on‑call stress for operations engineers, allowing them to focus on advanced technology research and enjoy more personal time.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Qunar Tech Salon
Qunar Tech Salon is a learning and exchange platform for Qunar engineers and industry peers. We share cutting-edge technology trends and topics, providing a free platform for mid-to-senior technical professionals to exchange and learn.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
