Wonder Monitoring: Scaling Ops with Open‑Falcon‑Powered Automation
This article explains how the internally built Wonder monitoring system, based on Open‑Falcon, tackles large‑scale operational challenges by offering automated agent updates, customizable metrics, log and port monitoring, persistent alarm storage, enhanced alert content, and comprehensive dashboards for thousands of devices.
Monitoring is a critical part of operations, ensuring timely detection of issues and providing data for post‑incident analysis. In large‑scale environments with tens of thousands of devices across global data centers, guaranteeing real‑time, accurate data collection and long‑term storage presents unprecedented challenges.
Wonder Monitoring System Overview
To address these challenges, we deeply customized Open‑Falcon and built an end‑to‑end monitoring solution called Wonder, which is now live and provides comprehensive monitoring for all company services.
Why Build Our Own?
Performance bottlenecks in existing solutions
Need for extensive customization
Improved usability
Full automation of monitoring workflows
Key Features
1. Agent Enhancements
Automatic update: agents report their current version and upgrade automatically when a new version is set.
Custom monitoring via HTTP scripts with multi‑key support.
Log monitoring with flexible string, numeric, or count‑based matching for alerts.
Remote port monitoring for TCP/UDP across multiple data‑center nodes; alerts trigger when all nodes fail to reach a port.
Agent resource usage monitoring (CPU, memory, disk).
2. Liveness & Ping Monitoring
Uses fping across multiple nodes to collect ping data; configurable timeout triggers alerts when a host is unreachable for a set number of minutes.
3. Alarm Persistence
All alarm events (problem → OK → problem) are stored in MongoDB, with the latest status displayed on the UI for easy business‑level visibility.
4. Judge Improvements
Persisted the last event in Redis to avoid alarm resets after a judge restart, and alarms automatically stop when a strategy is deleted or disabled.
5. HBS Improvements
Alarm strategies are refreshed from MySQL to Redis every 10 seconds, ensuring immediate effect of any changes.
6. Alert Content Enhancements
SMS alerts now include the business name alongside machine information, enabling quicker identification of the affected project.
Operational Benefits
Robust CMDB support for device inventory.
Customizable dashboards per role (operations, development, testing, management).
Diverse monitoring types: basic, port, log, custom, strategy templates, and script management.
Human‑friendly alert grouping, scheduling, and filtering.
Flexible alarm strategy inheritance and template configuration.
Historical data queries across thousands of devices with multi‑granularity visualizations.
Statistical reports that clearly show business health.
Fine‑grained user permission management.
App‑based alert delivery with SMS fallback, reducing costs and expanding message length.
Summary
Wonder currently monitors over 20 k devices, collects more than 5 million metrics, handles transfer QPS of 80 k+, and serves 1 000+ business services. Future enhancements will add trend prediction and intelligent auto‑remediation, making monitoring increasingly smart and automated.
360 Zhihui Cloud Developer
360 Zhihui Cloud is an enterprise open service platform that aims to "aggregate data value and empower an intelligent future," leveraging 360's extensive product and technology resources to deliver platform services to customers.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.