Operations 11 min read

Open-falcon in Automotive Home: Application, Architecture, and Customizations

This article describes how the open‑falcon monitoring system is applied and customized at Automotive Home, covering its architecture, component roles, a comparison with other open‑source solutions, and the enhancements made for service‑tree based dynamic monitoring, alerting, self‑healing, and high‑availability deployment.

HomeTech

Dec 30, 2021

Open-falcon in Automotive Home: Application, Architecture, and Customizations

The article introduces the use of Xiaomi's open‑falcon monitoring system at Automotive Home, explaining its application scenarios and the improvements made to meet the platform's needs.

Basic Monitoring Solution Comparison

Traditional monitoring tools like Zabbix no longer meet the performance and scalability requirements of fast‑growing internet companies. A comparison of three open‑source monitoring solutions—Zabbix, Prometheus, and Open-falcon—is presented, evaluating installation complexity, data collection support, storage difficulty, and alarm support.

Installation Complexity

Data Collection Support

Data Storage Difficulty

Alarm Support

Zabbix

Medium

Low

High

Prometheus

Low

High

Medium

Open-falcon

High

Medium

Low

Medium

The comparison shows that while Open-falcon is not the most feature‑rich, it offers the simplest deployment and low storage overhead, making it suitable for the company's scale and requirements.

Open-falcon Architecture Overview

Open-falcon is an open‑source, high‑availability, and extensible monitoring solution developed by Xiaomi's operations team. It follows a front‑back separation architecture: the backend is written in Go, the frontend in Python. Agents are installed on monitored machines to push metrics to the Transfer component.

Key components:

A) Agent : Collects metrics (e.g., cpu.idle, load.1min) every 60 seconds and pushes them to Transfer via a long‑lived connection; supports Linux and a Windows‑Agent released by Automotive Home.

B) Transfer : Receives data from agents, shards it by hash, and forwards it to Graph and Judge.

C) Judge : Evaluates metrics against configured strategies and expressions to trigger alerts.

D) Alarm : Persists alert events to MySQL, pushes them to Redis queues, and sends notifications asynchronously.

E) Graph : Stores time‑series data in memory and RRD files, serves query requests for dashboards.

F) Query : Handles data storage queries.

G) HBS : Provides caching to accelerate data access for other systems.

H) Dashboard : User‑facing interface for visualizing metrics and trends.

Customizations for Automotive Home

To integrate with the platform's CMDB, the dashboard and HBS components were rewritten to source monitoring objects from a service‑tree, enabling automatic binding of templates, inheritance of alert strategies, and reduction of manual configuration.

Dynamic service‑tree based monitoring templates allow automatic inheritance and independent configuration; new nodes automatically adopt appropriate templates.

Alert targets are no longer limited to servers; they can be services, hosts, or container nodes, and subscription configurations can be set per metric to notify different stakeholders.

A self‑healing feature was added: when an alert fires, a predefined scenario (composed of Salt‑executed tasks, scripts, or callbacks) runs automatically, with the ability to cancel the scenario before execution.

Alert components now support custom plugins, nodata handling, and multiple notification channels (DingTalk, SMS, phone) via internal notification interfaces.

Global and service‑tree based alert silencing is implemented, automatically suppressing alerts for assets in non‑operational states (e.g., installation, decommissioning).

High‑availability is achieved by deploying most components in active‑active mode across data centers; judge, graph, and nodata have standby failover mechanisms that reconfigure Transfer and Query on failure.

Future Outlook

The current monitoring stack relies on open‑falcon‑v0.1, which has not been upgraded for years. Plans include migrating to Nightingale for better query performance and automatic fault isolation, as well as developing richer hardware monitoring agents in collaboration with server manufacturers.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Monitoring Operations Open-Falcon self‑healing service tree

Written by

HomeTech

HomeTech tech sharing

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.