Operations 9 min read

How to Elevate Your Monitoring System: Proven Practices from Top DevOps Models

This article explains why modern services depend on highly available, scalable monitoring, outlines a systematic way to assess and improve monitoring capabilities using open‑source tools and the DevOps Capability Maturity Model, and details concrete improvement points across data collection, management, and application.

Efficient Ops

Mar 11, 2020

How to Elevate Your Monitoring System: Proven Practices from Top DevOps Models

Introduction

Monitoring systems aim to protect business SLA, give a comprehensive view of system health, detect risks early, and give operators more time to resolve issues.

Both open‑source solutions (Nagios, Zabbix, Prometheus, Grafana) and custom‑built systems (e.g., Xiaomi OpenFalcon, Tencent's internal tnm2 and CMS) are used.

As services rely more on monitoring, requirements for high availability and scalability increase, prompting a systematic review of monitoring capabilities.

Improvement Methods

One approach is to benchmark against top‑tier companies' monitoring systems; another is to follow the "DevOps Capability Maturity Model" published by the China Academy of Information and Communications Technology and major internet giants, which defines evaluation criteria for monitoring management.

Capability Items

a) Monitoring Collection

Capability: Support open, customizable data collection and reporting schemes.

Question: Why require a reporting scheme? Interpretation: Different collection methods (Agent, SDK, Kafka, ES) address varied scenarios. For example, an Agent watches file inodes for changes; an SDK can embed in business code to mask sensitive data before reporting; Kafka enables multiple consumers to read the same log stream.

Capability: Support multiple transmission modes, both push and pull.

Question: Why need both? Interpretation: Pull (server‑initiated) offers control, while push is needed in network‑restricted environments, high‑performance cases, or when services lack external APIs (e.g., a scheduled job that only pushes heartbeat data).

b) Data Management

Capability: Perform rule‑based processing of raw data upon ingestion.

Question: Why process before storage? Interpretation: Early validation prevents malformed logs from polluting the database and reduces costly post‑processing.

Capability: Correlate heterogeneous data sources.

Question: What does this mean? Interpretation: Enrich logs with external data such as IP‑to‑city mappings to enable richer analysis (e.g., geographic distribution of requests).

Capability: Ensure data consistency, completeness, and availability with built‑in management features.

Question: What are the management features? Interpretation: Self‑monitoring at each pipeline stage (ingestion, formatting, aggregation, storage) provides visibility into data flow and quickly isolates anomalies.

c) Data Application

Capability: Control alarm storms through suppression and convergence.

Question: What are common convergence methods? Interpretation: Time‑based, event‑based, and severity‑based convergence are widely used (e.g., Nagios/Zabbix time windows, trigger‑dependency for event‑based, level‑based for severity).

Conclusion

Improving monitoring capabilities can be guided by open‑source comparisons or the DevOps Capability Maturity Model; both provide valuable insights drawn from industry experts.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Operations Observability devops system reliability

Written by

Efficient Ops

This public account is maintained by Xiaotianguo and friends, regularly publishing widely-read original technical articles. We focus on operations transformation and accompany you throughout your operations career, growing together happily.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.