Operations 22 min read

How to Build a Complete Monitoring System: Goals, Methods, Tools & Best Practices

This guide explains why monitoring is essential for the entire operations lifecycle, outlines key monitoring objectives, describes practical methods and workflows, reviews a range of open‑source tools (including Zabbix, MRTG, Ganglia, Nagios, Smokeping, OpenTSDB), and details metric categories such as hardware, system, application, network, log, security, API, performance and business monitoring.

Liangxu Linux

Apr 29, 2020

How to Build a Complete Monitoring System: Goals, Methods, Tools & Best Practices

Why Monitoring Matters

Monitoring is a critical component of operations and the whole product lifecycle; it enables early fault detection, provides detailed data for post‑mortem analysis, and helps keep services reliable.

Monitoring Objectives

Continuous real‑time monitoring : Keep the system under constant observation.

Instant status feedback : Show whether components are normal, abnormal, or failed.

Reliability and safety : Ensure services and business run smoothly.

Business continuity : Quickly receive and handle alarms to maintain stable operations.

Monitoring Methods

Understand the target : Know what you are monitoring (e.g., CPU operation).

Define performance metrics : Identify attributes such as CPU usage, load, user/kernel time, context switches.

Set alarm thresholds : Determine when a metric indicates a fault.

Fault‑handling process : Establish efficient procedures for responding to alerts.

Core Monitoring Process

Problem detection : Receive fault alarms.

Problem localization : Analyze alarm content to pinpoint root cause (network, overload, firewall, etc.).

Problem resolution : Prioritize and fix the issue.

Post‑mortem summary : Document causes and preventive measures.

Monitoring Tools Overview

Traditional/Open‑Source Tools

MRTG : Perl‑based network traffic grapher using SNMP.

Ganglia : Scalable distributed monitoring for clusters, stores data with RRDtool.

Cacti : PHP/MySQL/SNMP tool for network graphing with templates and LDAP integration.

Nagios : Enterprise‑grade service and host monitoring with web UI.

Smokeping : Perl‑based network latency and packet loss visualizer using RRDtool.

OpenTSDB : Time‑series database on HBase for high‑resolution metrics.

Flagship Tools

Zabbix : Distributed monitoring system supporting SNMP, Agent, IPMI, JMX, SSH, etc.; stores data in MySQL or other databases; provides templated monitoring, visualization, and flexible alerting.

Open‑Falcon : Open‑source, internet‑grade monitoring platform from Xiaomi.

Third‑Party Services

Various commercial monitoring platforms (e.g., monitoring‑bao, Tingyun) are mentioned but not detailed.

Zabbix‑Based Monitoring Workflow

Data collection : Via SNMP, Agent, ICMP, SSH, IPMI.

Data storage : Typically in MySQL, but other databases are supported.

Data analysis : Historical graphs and timelines help locate faults.

Data display : Web UI (or custom mobile/web apps).

Alerting : Phone, email, WeChat, SMS, escalation mechanisms.

Alert handling : Prioritize alerts (critical, non‑critical) and assign appropriate personnel.

Monitoring Metric Categories

Hardware monitoring : Use IPMI to track CPU, memory, disk, temperature, fan speed, voltage, and set alarm thresholds.

System monitoring : Track CPU load, context switches, user/kernel usage (target user/kernel 70/30 , idle ~50%), memory usage, swap, disk I/O, network I/O. Tools: htop, top, vmstat, mpstat, dstat, glances. Zabbix provides OS Linux templates.

Application monitoring : Monitor services such as Nginx, PHP‑FPM, Redis, MySQL, RabbitMQ, JVM, etc., using Zabbix Agent UserParameter, JMX interface, or vendor‑specific plugins (e.g., Percona MySQL monitoring).

Network monitoring : Use Smokeping for latency, packet loss, DNS and HTTP performance; visualize with graphs.

Traffic analysis : Collect web traffic data via Google Analytics, Baidu Tongji, or open‑source Piwik (Matomo).

Log monitoring : Collect, store, query, and visualize logs with the ELK stack (Logstash, Elasticsearch, Kibana) or Zabbix log triggers.

Security monitoring : Deploy iptables, WAF (Nginx+Lua/OpenResty), or third‑party vulnerability services.

API monitoring : Track request methods (GET, POST, etc.), availability, correctness, and response time.

Performance monitoring : Measure DNS response, HTTP connection time, page load index, element size via Zabbix web monitoring.

Business monitoring : Define key business KPIs (orders per minute, registrations, active users, traffic sources) and set thresholds for alerts.

Alerting Channels

Common methods include SMS and email; escalation mechanisms can trigger automated actions (e.g., restart Nginx).

Alert Handling Process

Automatic escalation may restart services; for serious incidents, assign engineers based on severity and business impact. No single universal process fits all scenarios.

Interview‑Ready Monitoring Topics

Hardware

Use SNMP for routers/switches, IPMI for server health; cloud environments may skip hardware monitoring.

System

Monitor CPU load, context switches, memory usage, disk I/O, and set appropriate trigger thresholds.

Service

Leverage built‑in status modules (Nginx, PHP‑FPM) or vendor tools (Percona for MySQL) and custom scripts for other services.

Network

For multi‑datacenter setups, use Smokeping; otherwise rely on cloud provider tools.

Security

Use cloud security features, iptables, hardware firewalls, or third‑party services; consider DDoS protection.

Web

Monitor page latency, JS response, download time; commercial tools may be used for large‑scale deployments.

Log

Collect error logs (Nginx 4xx/5xx, PHP errors) with ELK stack or Zabbix log monitoring.

Business

Identify critical business metrics, script simple monitors, and configure triggers.

Traffic Analysis

Use awk/sed for raw logs or analytics platforms (Google, Baidu, Piwik) for easier reporting.

Visualization

Use screen tools or third‑party libraries to create dashboards that correlate traffic spikes with business events.

Automation

Implement Zabbix active/passive modes and API integration for large‑scale automation.

Conclusion

While many open‑source monitoring solutions exist, they may not meet all enterprise needs; companies often develop custom platforms (e.g., Xiaomi’s Open‑Falcon) or combine tools like Sensu, InfluxDB, and Grafana to build a tailored monitoring ecosystem.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

monitoring metrics Alerting open-source Zabbix

Written by

Liangxu Linux

Liangxu, a self‑taught IT professional now working as a Linux development engineer at a Fortune 500 multinational, shares extensive Linux knowledge—fundamentals, applications, tools, plus Git, databases, Raspberry Pi, etc. (Reply “Linux” to receive essential resources.)

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.