Operations 24 min read

Improvements and Architecture of Mt-Falcon Monitoring System

Mt‑Falcon, Meituan’s re‑engineered successor to Zabbix, introduces a modular architecture—Agent, Transfer, HBS, Judge, Graph, Alarm, Portal—and extensive refactorings that boost memory efficiency, asynchronous data handling, multi‑condition alerts, and API exposure, enabling over one million QPS, 200 million metrics, and robust, scalable monitoring across the company.

Meituan Technology Team

Feb 24, 2017

Improvements and Architecture of Mt-Falcon Monitoring System

Monitoring is a critical component of any business system, acting like the eyes that continuously observe data centers, networks, servers, and applications, and trigger timely responses when issues arise.

Meituan initially used Zabbix for monitoring, which scaled to over 20,000 machines and 4.5 million metrics, but faced serious limitations such as single‑point bottlenecks, difficult customization, complex configuration, and insufficient APIs.

To overcome these problems, Meituan evaluated alternatives and adopted the open‑source Open‑Falcon project originally released by Xiaomi, subsequently evolving it into Mt‑Falcon with a series of architectural and functional enhancements.

Mt‑Falcon Architecture

The system consists of Agent, Transfer, HBS (HeartBeat Server), Judge, Graph, Alarm, and Portal/Dashboard components. Images of the original Open‑Falcon and the enhanced Mt‑Falcon architectures are provided in the original article.

Key Improvements

1. Agent Refactoring

• Asynchronous data forwarding: metrics are cached locally and reported in batches (up to 10,000 items every 0.5 s), eliminating data loss under high load.

• NIC type tagging: the agent automatically adds a tag indicating whether the network card is 1 GbE, 10 GbE, dual‑1 GbE, or dual‑10 GbE, enabling precise alarm thresholds.

• Process‑level coredump monitoring: a dedicated metric is generated when a watched process generates a core dump.

• Automatic log rotation: the new Go‑Logger library splits logs by size or date and retains a configurable number of files.

• Hostname conflict resolution: the agent now reads the hostname from /etc/sysconfig/network to avoid accidental hostname changes, and can report the machine’s IP as an auxiliary metric.

• Agent heartbeat monitoring: the agent maintains a heartbeat with HBS; missing heartbeats for more than five minutes trigger an alarm.

2. HBS Refactoring

• Memory optimization: replaced JSON‑RPC (encoding/json) with RPC + MessagePack, reducing peak memory usage from >50 GB to <6 GB.

• API for aggregated monitoring policies: an endpoint allows querying the final set of policies applied to a specific host, handling group‑template‑policy inheritance.

• Template inheritance fix: policy deduplication now uses both policy ID and action ID, ensuring that multiple child templates derived from the same parent can coexist.

3. Transfer Refactoring

• Endpoint blacklist: supports disabling an entire endpoint or metrics with a specific prefix, preventing runaway metric explosion.

• OpenTSDB forwarding: selected critical metrics are duplicated to OpenTSDB for long‑term raw data storage.

4. Judge Refactoring

• Memory optimization: only metrics with configured alarm strategies are cached, reducing memory footprint.

• Persistent alarm state: alarm status is flushed to local disk on shutdown and reloaded on startup, preventing duplicate alerts after a restart.

• Alarm escalation: if an event is not resolved within 20 minutes, it is escalated from the primary to the secondary alarm group.

• ACK support: similar to Zabbix, users can acknowledge an alarm via a generated link that propagates through Transfer to Judge.

• Tag reverse selection: allows exclusion of specific tags (e.g., ^mount=/dev/shm) while keeping a generic monitoring rule.

• Multi‑condition forwarding: events marked as multi‑condition are routed to the plus_judge module for combined evaluation.

5. Graph Refactoring

• Index storage migration: moved from MySQL to a Redis + Tair solution, with optional BoltDB support.

• Automatic expiration and recreation of stale indexes.

• Historical data retrieval fix: for queries beyond 12 hours, missing points in RRD files are filled with cached raw data, improving completeness.

6. Alarm Refactoring

• Alarm aggregation: alarms are merged by metric, with the first three sent immediately and subsequent ones batched per minute.

• Distributed consumption: alarm tasks are stored in Redis, enabling multiple Alarm instances to process them concurrently.

• Priority‑based delivery: supports five priority levels (p0‑p9) with configurable channels (SMS, IM, email, phone).

• Persistence and statistics: alarms are now stored in MySQL, with daily top‑10 statistics by service, host, and owner, as well as a 7‑day trend chart and “red board” for high‑severity services.

• Responsible‑person routing: actions can be configured to notify the owner directly, and base‑monitoring items automatically trigger alerts to the responsible party.

7. Portal/Dashboard Refactoring

• Service‑tree binding for template creation and policy assignment.

• Full API exposure with authentication for external integration.

• Operation audit logging.

• Shift‑key multi‑selection, unified dark‑theme line colors, index self‑maintenance UI, refresh button for latest data, single‑chart refresh, and environment‑specific template application.

8. New Modules

• Ping monitoring using fping across data centers.

• String‑type metric handling via a dedicated string_judge module.

• Ratio/环比 monitoring with diff and pdiff functions, optionally backed by OpenTSDB.

• Multi‑condition monitoring via the plus_judge module, which aggregates strategies, generates a unique sequence number, and fires an alarm only when all conditions are satisfied.

Conclusion

Mt‑Falcon has completely replaced Zabbix at Meituan, handling over 1 million QPS and more than 200 million monitoring items. Future work focuses on unified monitoring across Meituan, configuration UI overhaul, automated alarm handling, and data operations.

Author Bio

Da Shan, leader of the SRE monitoring team at Meituan‑Dianping, with prior experience at Gaode and Sina. He focuses on fault auto‑tracking, automated remediation, and data‑driven operations to continuously improve monitoring stability, usability, and extensibility.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Monitoring architecture Alerting mt-falcon

Written by

Meituan Technology Team

Over 10,000 engineers powering China’s leading lifestyle services e‑commerce platform. Supporting hundreds of millions of consumers, millions of merchants across 2,000+ industries. This is the public channel for the tech teams behind Meituan, Dianping, Meituan Waimai, Meituan Select, and related services.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.