Operations 10 min read

Automated Network Failure Detection and Intelligent Switching System at Qunar

This article describes Qunar's automated network outage detection and intelligent traffic switching system, detailing the problem background, solution architecture, component functions, workflow, optimization steps, and future plans for more precise, multi‑level failover handling.

Qunar Tech Salon
Qunar Tech Salon
Qunar Tech Salon
Automated Network Failure Detection and Intelligent Switching System at Qunar

Background

When a data‑center outbound network link fails, all services go down and trigger alarms; the article asks what operators can do in such situations.

Problem Solved

The goal is to quickly detect issues automatically and switch traffic to redundant data‑centers using existing software and systems, illustrated by Qunar's switching system.

System Overview

The system detects outbound failures, automatically switches inbound (user) traffic by updating DNS records, and redirects outbound (service) traffic by changing proxy addresses, ensuring continuous access.

Intelligent Switching System

When IDC outbound anomalies are detected, the system automatically identifies the fault, switches traffic to redundant sites, and maintains service availability.

Inbound Traffic Switching

Users access services via DNS; upon detection of an outbound fault, the system modifies authoritative DNS to point to a backup data‑center.

Outbound Traffic Switching

Internal services use proxy addresses; when a fault is detected, the proxy address is automatically updated to the backup site, redirecting outbound requests.

Requirements for Building the Switching System

Dynamic periodic testing

Effective aggregation and classification of detection data

Multi‑data‑center deployment for services

Comprehensive backend support

Component Description

Layer 1: Network detection using Smokeping alerts. Layer 2: Data aggregation and analysis with an internal monitoring tool (Watcher). Layer 3: Application switching layer including DNS manager and proxy manager, which performs actual traffic redirection based on analysis results.

System Workflow

Smokeping monitors nationwide points.

Abnormal packet loss or latency data is filtered.

Data is tagged and sent to Watcher.

Watcher classifies data by dimensions.

Network anomalies are identified after excluding host‑level issues.

Aggregated metrics per data‑center and ISP are calculated.

Automatic application-level switching is triggered.

Visualization

Weathermap (a Cacti plugin) visualizes Smokeping detection results.

Component Analysis

Defines how abnormal data is tagged, thresholds for packet loss (5% and 10%), and the use of multi‑pointer detection when monitoring points are numerous.

Optimization Process

Improvements include refining ICMP anomaly detection thresholds, setting a 30‑second detection cycle, employing multi‑pointer detection, optimizing aggregation thresholds, and ensuring robust logging and pre‑alerting mechanisms.

Future Plans

Three focus areas: more precise and broader detection covering more cities and ISPs; support for additional languages and service types; finer‑grained switching by province or region to reduce switch time and increase intelligence.

MonitoringopsDNSNetwork AutomationfailoverAI ops
Qunar Tech Salon
Written by

Qunar Tech Salon

Qunar Tech Salon is a learning and exchange platform for Qunar engineers and industry peers. We share cutting-edge technology trends and topics, providing a free platform for mid-to-senior technical professionals to exchange and learn.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.