How to Build a Complete Monitoring System: Goals, Methods, Tools & Best Practices
This guide explains why monitoring is essential for the entire operations lifecycle, outlines key monitoring objectives, describes practical methods and workflows, reviews a range of open‑source tools (including Zabbix, MRTG, Ganglia, Nagios, Smokeping, OpenTSDB), and details metric categories such as hardware, system, application, network, log, security, API, performance and business monitoring.
Why Monitoring Matters
Monitoring is a critical component of operations and the whole product lifecycle; it enables early fault detection, provides detailed data for post‑mortem analysis, and helps keep services reliable.
Monitoring Objectives
Continuous real‑time monitoring : Keep the system under constant observation.
Instant status feedback : Show whether components are normal, abnormal, or failed.
Reliability and safety : Ensure services and business run smoothly.
Business continuity : Quickly receive and handle alarms to maintain stable operations.
Monitoring Methods
Understand the target : Know what you are monitoring (e.g., CPU operation).
Define performance metrics : Identify attributes such as CPU usage, load, user/kernel time, context switches.
Set alarm thresholds : Determine when a metric indicates a fault.
Fault‑handling process : Establish efficient procedures for responding to alerts.
Core Monitoring Process
Problem detection : Receive fault alarms.
Problem localization : Analyze alarm content to pinpoint root cause (network, overload, firewall, etc.).
Problem resolution : Prioritize and fix the issue.
Post‑mortem summary : Document causes and preventive measures.
Monitoring Tools Overview
Traditional/Open‑Source Tools
MRTG : Perl‑based network traffic grapher using SNMP.
Ganglia : Scalable distributed monitoring for clusters, stores data with RRDtool.
Cacti : PHP/MySQL/SNMP tool for network graphing with templates and LDAP integration.
Nagios : Enterprise‑grade service and host monitoring with web UI.
Smokeping : Perl‑based network latency and packet loss visualizer using RRDtool.
OpenTSDB : Time‑series database on HBase for high‑resolution metrics.
Flagship Tools
Zabbix : Distributed monitoring system supporting SNMP, Agent, IPMI, JMX, SSH, etc.; stores data in MySQL or other databases; provides templated monitoring, visualization, and flexible alerting.
Open‑Falcon : Open‑source, internet‑grade monitoring platform from Xiaomi.
Third‑Party Services
Various commercial monitoring platforms (e.g., monitoring‑bao, Tingyun) are mentioned but not detailed.
Zabbix‑Based Monitoring Workflow
Data collection : Via SNMP, Agent, ICMP, SSH, IPMI.
Data storage : Typically in MySQL, but other databases are supported.
Data analysis : Historical graphs and timelines help locate faults.
Data display : Web UI (or custom mobile/web apps).
Alerting : Phone, email, WeChat, SMS, escalation mechanisms.
Alert handling : Prioritize alerts (critical, non‑critical) and assign appropriate personnel.
Monitoring Metric Categories
Hardware monitoring : Use IPMI to track CPU, memory, disk, temperature, fan speed, voltage, and set alarm thresholds.
System monitoring : Track CPU load, context switches, user/kernel usage (target user/kernel 70/30 , idle ~50%), memory usage, swap, disk I/O, network I/O. Tools: htop, top, vmstat, mpstat, dstat, glances. Zabbix provides OS Linux templates.
Application monitoring : Monitor services such as Nginx, PHP‑FPM, Redis, MySQL, RabbitMQ, JVM, etc., using Zabbix Agent UserParameter, JMX interface, or vendor‑specific plugins (e.g., Percona MySQL monitoring).
Network monitoring : Use Smokeping for latency, packet loss, DNS and HTTP performance; visualize with graphs.
Traffic analysis : Collect web traffic data via Google Analytics, Baidu Tongji, or open‑source Piwik (Matomo).
Log monitoring : Collect, store, query, and visualize logs with the ELK stack (Logstash, Elasticsearch, Kibana) or Zabbix log triggers.
Security monitoring : Deploy iptables, WAF (Nginx+Lua/OpenResty), or third‑party vulnerability services.
API monitoring : Track request methods (GET, POST, etc.), availability, correctness, and response time.
Performance monitoring : Measure DNS response, HTTP connection time, page load index, element size via Zabbix web monitoring.
Business monitoring : Define key business KPIs (orders per minute, registrations, active users, traffic sources) and set thresholds for alerts.
Alerting Channels
Common methods include SMS and email; escalation mechanisms can trigger automated actions (e.g., restart Nginx).
Alert Handling Process
Automatic escalation may restart services; for serious incidents, assign engineers based on severity and business impact. No single universal process fits all scenarios.
Interview‑Ready Monitoring Topics
Hardware
Use SNMP for routers/switches, IPMI for server health; cloud environments may skip hardware monitoring.
System
Monitor CPU load, context switches, memory usage, disk I/O, and set appropriate trigger thresholds.
Service
Leverage built‑in status modules (Nginx, PHP‑FPM) or vendor tools (Percona for MySQL) and custom scripts for other services.
Network
For multi‑datacenter setups, use Smokeping; otherwise rely on cloud provider tools.
Security
Use cloud security features, iptables, hardware firewalls, or third‑party services; consider DDoS protection.
Web
Monitor page latency, JS response, download time; commercial tools may be used for large‑scale deployments.
Log
Collect error logs (Nginx 4xx/5xx, PHP errors) with ELK stack or Zabbix log monitoring.
Business
Identify critical business metrics, script simple monitors, and configure triggers.
Traffic Analysis
Use awk/sed for raw logs or analytics platforms (Google, Baidu, Piwik) for easier reporting.
Visualization
Use screen tools or third‑party libraries to create dashboards that correlate traffic spikes with business events.
Automation
Implement Zabbix active/passive modes and API integration for large‑scale automation.
Conclusion
While many open‑source monitoring solutions exist, they may not meet all enterprise needs; companies often develop custom platforms (e.g., Xiaomi’s Open‑Falcon) or combine tools like Sensu, InfluxDB, and Grafana to build a tailored monitoring ecosystem.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Liangxu Linux
Liangxu, a self‑taught IT professional now working as a Linux development engineer at a Fortune 500 multinational, shares extensive Linux knowledge—fundamentals, applications, tools, plus Git, databases, Raspberry Pi, etc. (Reply “Linux” to receive essential resources.)
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
