Operations 21 min read

Hardware Automation Operations System at Qunar: Design, Implementation, and Lessons Learned

This article details Qunar's hardware automation operations platform, covering the hardware scope, pain points of manual processes, a five‑stage lifecycle, automated testing, data collection, fault handling, and the underlying Mesos‑Marathon‑Docker infrastructure that together improve efficiency, reliability, and cost control.

Qunar Tech Salon
Qunar Tech Salon
Qunar Tech Salon
Hardware Automation Operations System at Qunar: Design, Implementation, and Lessons Learned

Author Introduction

Liang Liu, senior operations development engineer at Qunar, studied at Huazhong University of Science and Technology and the Chinese Academy of Sciences, later worked at Baidu and joined Qunar in 2014 to focus on server and network device automation.

Introduction

The talk shares Qunar's experience in building a hardware automation operations system, organized into background overview, work description, detailed implementation, and summary review.

1. Background Overview

The hardware scope spans data‑center cabinets, servers, network devices, power supplies, fans, and other components. Before automation, a single engineer was responsible for tens of thousands of servers, handling rack‑up, migration, installation, and configuration manually.

Key pain points included massive manual workload, uncontrolled hardware quality, low fault‑handling efficiency, lack of performance data, and risky manual command execution.

2. Work Description

A five‑stage hardware lifecycle is defined: selection & testing, delivery & rack‑up, data collection, fault handling, and decommission.

Automation targets include automated testing to verify configuration, performance, and reliability; batch arrival detection; centralized data collection via agents; anomaly monitoring; and automated ticketing.

3. Specific Implementation

Selection testing compares vendor specifications with actual measurements, emphasizing cost‑performance and configuration consistency.

Batch arrival detection uses rule‑based format‑standardization to normalize hardware data from different vendors, ensuring identical configuration representations.

Data collection combines local daemons that periodically push metrics and remote pulls to feed a CMDB and a time‑series platform (Watcher) for real‑time monitoring of servers, containers, databases, and other services.

Fault handling classifies alerts into Critical, Warning, and OK, automatically gathers logs, generates formatted emails, and tracks repair tickets across suppliers.

The infrastructure relies on Marathon‑managed Mesos clusters, Docker containers, and a Celery‑based Beat/Worker architecture to provide high availability, elastic scaling, and asynchronous task processing.

4. Summary Review

After deploying the automation system, Qunar achieved unattended operations, unified hardware lifecycle management, improved quality and risk control, and significantly reduced operational costs.

monitoringdata collectionoperationsfault handlinghardware automation
Qunar Tech Salon
Written by

Qunar Tech Salon

Qunar Tech Salon is a learning and exchange platform for Qunar engineers and industry peers. We share cutting-edge technology trends and topics, providing a free platform for mid-to-senior technical professionals to exchange and learn.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.