Operations 7 min read

Wonder Monitoring: Scaling Ops with Open‑Falcon‑Powered Automation

This article explains how the internally built Wonder monitoring system, based on Open‑Falcon, tackles large‑scale operational challenges by offering automated agent updates, customizable metrics, log and port monitoring, persistent alarm storage, enhanced alert content, and comprehensive dashboards for thousands of devices.

360 Zhihui Cloud Developer

Oct 19, 2016

Wonder Monitoring: Scaling Ops with Open‑Falcon‑Powered Automation

Monitoring is a critical part of operations, ensuring timely detection of issues and providing data for post‑incident analysis. In large‑scale environments with tens of thousands of devices across global data centers, guaranteeing real‑time, accurate data collection and long‑term storage presents unprecedented challenges.

Wonder Monitoring System Overview

To address these challenges, we deeply customized Open‑Falcon and built an end‑to‑end monitoring solution called Wonder, which is now live and provides comprehensive monitoring for all company services.

Why Build Our Own?

Performance bottlenecks in existing solutions

Need for extensive customization

Improved usability

Full automation of monitoring workflows

Key Features

1. Agent Enhancements

Automatic update: agents report their current version and upgrade automatically when a new version is set.

Custom monitoring via HTTP scripts with multi‑key support.

Log monitoring with flexible string, numeric, or count‑based matching for alerts.

Remote port monitoring for TCP/UDP across multiple data‑center nodes; alerts trigger when all nodes fail to reach a port.

Agent resource usage monitoring (CPU, memory, disk).

2. Liveness & Ping Monitoring

Uses fping across multiple nodes to collect ping data; configurable timeout triggers alerts when a host is unreachable for a set number of minutes.

3. Alarm Persistence

All alarm events (problem → OK → problem) are stored in MongoDB, with the latest status displayed on the UI for easy business‑level visibility.

4. Judge Improvements

Persisted the last event in Redis to avoid alarm resets after a judge restart, and alarms automatically stop when a strategy is deleted or disabled.

5. HBS Improvements

Alarm strategies are refreshed from MySQL to Redis every 10 seconds, ensuring immediate effect of any changes.

6. Alert Content Enhancements

SMS alerts now include the business name alongside machine information, enabling quicker identification of the affected project.

Operational Benefits

Robust CMDB support for device inventory.

Customizable dashboards per role (operations, development, testing, management).

Diverse monitoring types: basic, port, log, custom, strategy templates, and script management.

Human‑friendly alert grouping, scheduling, and filtering.

Flexible alarm strategy inheritance and template configuration.

Historical data queries across thousands of devices with multi‑granularity visualizations.

Statistical reports that clearly show business health.

Fine‑grained user permission management.

App‑based alert delivery with SMS fallback, reducing costs and expanding message length.

Summary

Wonder currently monitors over 20 k devices, collects more than 5 million metrics, handles transfer QPS of 80 k+, and serves 1 000+ business services. Future enhancements will add trend prediction and intelligent auto‑remediation, making monitoring increasingly smart and automated.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Monitoring Automation Operations Alerting infrastructure Open-Falcon

Written by

360 Zhihui Cloud Developer

360 Zhihui Cloud is an enterprise open service platform that aims to "aggregate data value and empower an intelligent future," leveraging 360's extensive product and technology resources to deliver platform services to customers.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.