Operations 19 min read

Mastering IT Monitoring: Strategies, Challenges, and Best Practices

This article explores the fundamentals of IT monitoring, examines common challenges such as scalability, reliability, and alert fatigue, compares four implementation approaches—from open‑source to fully custom solutions—and presents practical techniques like alert convergence, suppression, and automation to build a robust, adaptable monitoring platform.

Efficient Ops

Jun 15, 2021

Mastering IT Monitoring: Strategies, Challenges, and Best Practices

What Is Monitoring?

Monitoring (监控) combines measurement and control , emphasizing the measurement aspect. In IT operations it means sampling target states to assess system health, focusing on key metrics such as CPU, memory, network, and application performance.

Current State of Monitoring Construction

Building a monitoring platform is a long‑term effort, typically following one of four paths:

Based on open‑source platforms (e.g., Nagios, Zabbix, Prometheus, TIGK stack).

Using commercial platforms (e.g., IBM, HP, domestic solutions like 云智慧, 监控易, OneAPM).

Second‑development of open‑source products (custom extensions of Zabbix, Prometheus, Open‑falcon).

Fully custom development from scratch.

Each approach has distinct characteristics and limitations, and the choice depends on resources, business needs, and organizational maturity.

Four Approaches Explained

1. Open‑Source Platforms – Deploy the software, then enrich data collection. Advantages: free, community support, high customizability.

2. Commercial Platforms – Pay for ready‑made services, reducing implementation effort. Suitable for complex scenarios with limited manpower.

3. Second‑Development – Extend open‑source APIs to add features not provided out‑of‑the‑box. Offers flexibility while leveraging existing ecosystems.

4. Fully Custom Development – Build a system tailored to precise business requirements, but requires significant time, talent, and risk management.

Key Challenges in Monitoring Construction

Missing Critical Metrics – Default metrics often fail to meet specific use cases; systems must allow free expansion.

Difficulty Extending Functionality – Lack of native support for diverse data sources (e.g., SNMP) forces developers to write custom collectors.

Reliability and High Availability – Large‑scale deployments (5k‑10k nodes) demand high QPS, horizontal scaling, and robust data pipelines (采集→清洗→分析→入库).

Alert Quality – Avoid false positives, missed alerts, and latency; ensure alerts are timely, accurate, and actionable.

Alert Fatigue – Excessive alerts overwhelm users; alert convergence, suppression, aggregation, and analysis are essential to reduce noise.

Fault Correlation and Root‑Cause Analysis – Correlate related incidents (e.g., rack power loss causing multiple service failures) to pinpoint the underlying cause.

Performance Forecasting – Use historical data for capacity planning, trend prediction, and proactive scaling.

Permission Management – Support granular role‑based access (admin, operator, viewer) to adapt to evolving organizational structures.

Practical Practices

1. Lower the Entry Barrier – Plug‑and‑Play – Deploy agents automatically; default alert policies trigger on threshold breaches.

2. Combat Alert Storms – Implement four “magic tricks”:

Alert convergence – merge identical alerts across many hosts.

Alert suppression – keep only the highest severity per metric.

Alert aggregation – combine alerts from different rules occurring simultaneously.

Alert analysis – mine historical alerts to refine thresholds and detect anomalies.

3. Easy Extension – Modular Plugins – Add new collectors or integrations without manual deployment.

4. Fine‑Grained Permission Control – Define roles for alert receivers, configurators, and platform administrators.

5. Automation First – Configure data collection, plugin deployment, and alert targeting through the UI, enabling dynamic scaling as hosts are added or removed.

Summary and Recommendations

Effective monitoring requires a clear business goal, selection of a mature platform (open‑source or commercial), and continuous iteration. Design the system as a data pipeline that can scale to petabyte‑level storage, support high concurrency, and provide reliable, low‑latency alerts. Treat monitoring as a core component of the DevOps workflow, integrating with CMDB, CI/CD pipelines, and automation tools to achieve higher efficiency, quality, and capability.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Monitoring Automation Operations scalability System Design Alert Management

Written by

Efficient Ops

This public account is maintained by Xiaotianguo and friends, regularly publishing widely-read original technical articles. We focus on operations transformation and accompany you throughout your operations career, growing together happily.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.