Operations 19 min read

Mastering System Monitoring: Goals, Methods, Tools, and Best Practices

This comprehensive guide explains why monitoring is vital for operations, outlines monitoring objectives, methods, core processes, and a detailed overview of open‑source and commercial tools—including Zabbix, Open‑Falcon, and MRTG—while covering metrics, alert handling, and interview preparation for effective system monitoring.

MaGe Linux Operations

Apr 17, 2020

Mastering System Monitoring: Goals, Methods, Tools, and Best Practices

Monitoring Overview

Monitoring is a critical part of operations and the product lifecycle, providing early warnings before failures and detailed data for post‑incident analysis.

1. Monitoring Objectives

Continuous real‑time monitoring : Constantly observe system health.

Real‑time status feedback : Detect normal, abnormal, or fault states instantly.

Ensure reliability and safety : Keep services and business running smoothly.

Maintain business continuity : Promptly receive and handle alerts to sustain stable operations.

2. Monitoring Methods

Understand the monitoring target (e.g., how CPU works).

Define performance baseline metrics (CPU usage, load, user/kernel time, context switches).

Set alarm thresholds (what load constitutes a fault).

Establish fault‑handling procedures.

3. Core Monitoring Process

Detect problems : Receive fault alerts.

Locate problems : Analyze alert details to identify root causes (network, overload, firewall, etc.).

Resolve problems : Prioritize and fix based on severity.

Summarize problems : Document causes and preventive measures.

4. Monitoring Tools

Tools are categorized as follows:

1) Legacy Monitoring

MRTG – network traffic graphing (Perl, SNMP).

Ganglia – scalable distributed monitoring for clusters.

Cacti – PHP/MySQL/SNMP based graphing.

Nagios – enterprise‑level service and host monitoring.

Smokeping – network performance visualization.

OpenTSDB – distributed time‑series database on HBase.

2) Flagship Monitoring

Zabbix – distributed monitoring with agents, SNMP, IPMI, JMX, etc.

Open‑Falcon – open‑source internet‑grade monitoring platform.

3) Third‑Party Monitoring

Various commercial services (e.g., Jiankongbao, Jiankongyi) are available but not detailed here.

5. Zabbix Monitoring Workflow

Data collection : SNMP, Agent, ICMP, SSH, IPMI.

Data storage : MySQL or other databases.

Data analysis : Generate graphs and timelines for post‑mortem.

Data display : Web UI (or custom apps).

Alerting : Phone, email, WeChat, SMS, escalation.

Alert handling : Process based on severity and assign personnel.

6. Monitoring Metrics

Metrics are grouped into hardware, system, application, network, traffic analysis, log, security, API, performance, and business monitoring.

1) Hardware Monitoring

Use IPMI to monitor CPU, memory, disk, temperature, fan speed, voltage, and set alarm thresholds.

2) System Monitoring

Track CPU usage, load, context switches, memory usage, swap, disk I/O, network I/O using tools like htop, vmstat, iostat, sar, glances, and Zabbix templates.

3) Application Monitoring

Monitor services such as LVS, HAProxy, Docker, Nginx, PHP, Memcached, Redis, MySQL, RabbitMQ via Zabbix agents, JMX, or custom scripts.

4) Network Monitoring

Use Smokeping for latency and packet loss visualization; commercial services can monitor CDN and inter‑datacenter links.

5) Traffic Analysis

Analyze visitor sources, conversion, and region statistics with tools like Baidu Tongji, Google Analytics, or the open‑source Piwik.

6) Log Monitoring

Collect, store, query, and visualize logs using the ELK stack (Elasticsearch, Logstash, Kibana) or Zabbix log filters.

7) Security Monitoring

Combine firewall rules, WAF (e.g., Nginx+Lua), and third‑party vulnerability services for comprehensive protection.

8) API Monitoring

Monitor API request methods, availability, correctness, and response time.

9) Performance Monitoring

Measure website performance metrics (DNS response, connection time, page load, availability) via Zabbix web monitoring.

10) Business Monitoring

Track key business KPIs such as orders per minute, registrations, active users, and promotion impact, visualized on Zabbix screens.

7. Alert Handling

Automated escalation can restart services (e.g., Nginx) on failure; severe incidents are assigned to appropriate operators based on severity and business impact.

8. Interview Preparation

Typical interview questions cover hardware (SNMP, IPMI), system metrics (CPU load, memory, disk I/O), service monitoring (Nginx status, PHP‑FPM, MySQL), network monitoring, security measures, and scripting for custom checks.

Original source: http://www.yunweipai.com/archives/22459.html

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Operations system metrics Zabbix

Written by

MaGe Linux Operations

Founded in 2009, MaGe Education is a top Chinese high‑end IT training brand. Its graduates earn 12K+ RMB salaries, and the school has trained tens of thousands of students. It offers high‑pay courses in Linux cloud operations, Python full‑stack, automation, data analysis, AI, and Go high‑concurrency architecture. Thanks to quality courses and a solid reputation, it has talent partnerships with numerous internet firms.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.