Operations 22 min read

How Qunar Built an Automated Hardware Operations Platform to Boost Efficiency

This article details Qunar's end‑to‑end hardware automation system, covering background challenges, lifecycle management, automated testing, data collection, fault detection, and visualized monitoring, and explains how the integrated platform reduces manual effort, improves reliability, and cuts operational costs.

Efficient Ops

Aug 16, 2017

How Qunar Built an Automated Hardware Operations Platform to Boost Efficiency

Preface

I am pleased to share Qunar's experience in hardware operations automation. The talk is divided into four parts: background overview, work description, specific implementation, and summary review.

Background overview Work description Specific implementation Summary review

1. Background Overview

Our hardware scope ranges from every rack in the data‑center to each individual device, including servers, network switches, routers, etc. The hardware can be grouped into four categories.

We further drill down to each device, such as a server, and examine components like CPU, memory, power supply, fans, and other peripherals. Before automation, Qunar faced several pain points: a single engineer had to manage tens of thousands of servers, operations like rack‑mounting, migration, provisioning were labor‑intensive, hardware quality was uncontrolled, fault handling was slow, and manual SSH access posed security risks.

To address these six pain points we aimed for automation and intelligence: ensure operational safety, guarantee hardware quality, improve efficiency, and ultimately reduce costs.

2. Work Description

The core concept is the hardware lifecycle, covering five stages from selection, procurement, rack‑mounting, operation, to decommissioning.

We perform targeted work for each stage: selection testing, arrival inspection, monitoring and alerting, and disposal handling.

3. Specific Implementation

Suppliers provide reference data that often overstates performance; our tests reveal the real gain is much lower. Therefore we focus on cost‑performance and choose configurations that match our actual workload.

We standardize BIOS, RAID, and OS settings to obtain peak performance, and we score each hardware configuration on CPU, memory, and I/O metrics.

When bulk shipments arrive, we encounter five common issues: missing components, defective parts, batch‑level defects, configuration mismatches, and damage during transport. Our platform verifies that the delivered configuration matches the tested baseline, performance meets standards, and no faults exist before deployment.

Data collection is achieved through a hybrid approach: agents on each machine push daemon data to the backend, while remote agents retrieve otherwise inaccessible metrics. This satisfies the diverse data needs of data‑center operators, hardware inventory, performance baselines, fault records, and time‑series system metrics.

We maintain an internal CMDB that stores hardware configuration, status (online, under repair, faulty), and metadata such as rack location, serial numbers, RAID and SSD details. A second system, Watcher, aggregates real‑time time‑series metrics from servers, containers, databases, and cloud services, providing both monitoring and alert configuration.

Automation includes rule‑based formatting of raw hardware data into a unified schema, enabling consistent downstream processing regardless of vendor or batch variations.

Fault handling is streamlined: alarms are classified into Critical, Warning, and OK; critical alerts trigger automated log collection, formatted email generation, and ticket creation. The system tracks repair progress and ensures closure.

Visualization tools display rack layouts, temperature, power consumption, and fault statistics, allowing operators to quickly assess the health of the data‑center.

4. Summary Review

By implementing this automated hardware operations platform, Qunar achieved unattended operation, integrated testing, and fault tracking, freeing engineers to focus on higher‑value tasks, improving reliability, reducing risk, and significantly lowering operational costs.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

monitoring Operations CMDB hardware automation fault management

Written by

Efficient Ops

This public account is maintained by Xiaotianguo and friends, regularly publishing widely-read original technical articles. We focus on operations transformation and accompany you throughout your operations career, growing together happily.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.