Design and Implementation of Qunar Network Device Operations Platform
Facing growing network device counts and limited netops staff, Qunar built a network device operations platform that integrates command automation, permission-controlled tasks, monitoring, and dynamic scaling using Docker, Marathon, and Celery, thereby improving efficiency, reducing risk, and enabling comprehensive auditability.
Background
Qunar’s network equipment has been increasing year by year, while the netops team remains very small, causing a continuous rise in per‑person workload. Existing change operations rely on manual command‑line and script execution, which is inefficient, error‑prone, and lacks audit trails.
Optimization Ideas
Integrate common tools and commands into a platform to execute repetitive operations automatically.
Decompose frequent tasks into atomic command sets and expose them as executable task lists.
Implement intelligent pre‑checks and automatic rollback to prevent unsafe changes.
Introduce hierarchical permission control so that only authorized users can perform specific actions.
Record every operation (executor, content, time, result) for later audit.
Platform Overview
The Qunar Network Device Operations Platform was launched based on the above ideas. It provides a web interface, permission management, task scheduling, and monitoring capabilities.
1. Permission Control
Permissions are divided into five levels: Visitor, Read‑Only, Read‑Write, Administrator, and Super Administrator. Each atomic operation is bound to a specific level, ensuring that users can only perform actions permitted for their role.
Tips: High‑level users can grant lower‑level permissions to others and view logs of users they have authorized, while users cannot view logs of peers or higher‑level users.
2. Operations and Tasks
The platform supports automated operations such as scanning core‑to‑access switch relationships, backing up configurations, toggling ports, modifying descriptions, setting speeds, assigning VLANs, and locking ports.
When a user confirms an operation, a Celery task is launched. Tasks are of two types: immediate tasks that execute the change via SSH and rollback on failure, and scheduled tasks that can be triggered once or repeatedly.
All tasks automatically generate detailed logs that can be queried according to the user’s permission level.
3. Monitoring Management
The platform monitors two data layers: network‑level (clusters of core and access switches) and device‑level (individual switch ports). Network‑level data is visualized with weathermap diagrams; device‑level metrics are collected via SNMP using collectd.
Monitoring configuration is flexible: users can define metrics, templates, and matching rules, and the system automatically updates collectd instances across a Docker‑Marathon cluster, providing dynamic scaling and load balancing.
Watcher dashboards display real‑time port load, with colors indicating normal (green), overloaded (red), or idle (gray) states.
Conclusion
The Qunar Network Device Operations Platform addresses the challenges of increasing device scale and limited manpower by automating routine tasks, enforcing permission controls, providing comprehensive monitoring, and ensuring auditability, thereby enhancing operational efficiency and reducing risk while supporting continuous iteration for future netops needs.
Qunar Tech Salon
Qunar Tech Salon is a learning and exchange platform for Qunar engineers and industry peers. We share cutting-edge technology trends and topics, providing a free platform for mid-to-senior technical professionals to exchange and learn.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.