How Qunar Built an Automated Network Device Operations Platform to Boost Efficiency
This article explains how Qunar tackled growing network device management workload, low‑efficiency manual processes, and operational risk by designing an integrated platform that automates common tasks, enforces permission‑based controls, records audits, and provides real‑time monitoring and scalable data collection.
Background
Qunar’s network device count has been increasing year over year while the netops team remains very small, leading to a continuously growing per‑person workload. Manual command‑line and script‑based changes are inefficient, repetitive, and risky because errors are hard to detect and no audit logs are kept.
Optimization Ideas
Integrate common tools and commands into a unified platform to execute frequent repetitive operations.
Automate operations by breaking them into basic instruction sets and exposing them as executable task lists.
Intelligently pre‑check operations to prevent uncontrolled changes, enforce atomic actions, and automatically roll back abnormal executions.
Implement hierarchical permission control, assigning different privilege levels to users and tasks.
Record every operation’s executor, content, timestamp, and result for later audit and traceability.
Platform Overview
The Qunar Network Device Operations Platform was built based on the above ideas. Screenshots of the platform are shown below.
1. Permission Control
The platform defines five permission levels: Visitor, Read‑Only, Read‑Write, Administrator, and Super‑Administrator. Each atomic operation is bound to a specific level, ensuring that users can only perform actions permitted by their role.
2. Operations & Tasks
The platform supports automated tasks such as:
Scanning core and access switch relationships.
Fetching, backing up, and synchronizing global and per‑port configurations.
Port up/down, description changes, speed adjustments, and trunk configuration.
VLAN assignment and port locking.
When a user confirms an operation, a Celery task is launched. There are two task types:
Immediate tasks : Execute the operation instantly, with automatic rollback and warning on failure.
Scheduled tasks : Allow users to trigger one‑time or recurring operations.
3. Monitoring Management
The platform monitors two data layers:
Network layer: clusters of core switches and their associated access switches.
Device layer: individual switch ports.
Network topology and traffic load are visualized using weathermap diagrams. Device‑level metrics are collected via SNMP using collectd; users can configure metrics and templates through the platform, which automatically updates the Docker‑based collection cluster.
Monitoring configuration is flexible, allowing changes to metrics, templates, and matching rules, with automatic updates for new or removed devices.
Conclusion
The Qunar Network Device Operations Platform addresses the challenges of increasing workload, inefficient manual processes, and operational risk, improving efficiency, reducing errors, and streamlining workflows. Future work will focus on handling more complex netops scenarios with rapid iteration and continuous optimization.
Efficient Ops
This public account is maintained by Xiaotianguo and friends, regularly publishing widely-read original technical articles. We focus on operations transformation and accompany you throughout your operations career, growing together happily.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.