Essential Guide to Effective Monitoring in Operations: Goals, Methods, and Tools
This article outlines the essential components of operational monitoring, covering monitoring objectives, methods, core processes, key tools, metrics for hardware, system, application, network, and business layers, as well as alerting, handling, and best practices for building a comprehensive, reliable monitoring solution.
Monitoring Objectives
Understanding the importance of monitoring and the business goals it should achieve.
Real‑time monitoring of target systems.
Feedback on current status of hardware, software, and services.
Ensure reliability so issues are reported instantly for rapid response.
Monitoring Methods
Identify monitoring objects (e.g., how CPU works).
Define performance baseline metrics such as CPU usage, load, user/kernel time, context switches.
Set alarm thresholds (e.g., what CPU load is considered high).
Establish fault‑handling procedures for efficient resolution.
Monitoring Core
Problem discovery.
Problem localization.
Problem resolution.
Summarize causes and preventive measures to avoid recurrence.
Monitoring Tools
Traditional tools: Cacti, Nagios, Smokeping.
Popular tools: Zabbix, OpenFalcon, Prometheus + Grafana, Nightingale, smartping, LEPUS, custom solutions.
Third‑party services: MonitoringBao, Tingyun, New Relic.
Monitoring Process
Collect : Data collection via SNMP, agents, ICMP, SSH, IPMI, etc.
Store : Store in databases such as MySQL, PostgreSQL.
Analyze : Provide graphs and timelines to locate faults.
Display : Show metric values and trends.
Alert : Notify via phone, email, WeChat, SMS, with escalation.
Handle : Determine fault level and assign responders for rapid remediation.
Monitoring Metrics
Hardware
CPU temperature, physical/virtual disks, motherboard temperature, RAID status (e.g., via MegaCli).
System
Host availability, CPU/memory/disk usage, inode, load, network bandwidth, TCP connections, disk I/O.
Application
MySQL
Service availability, memory usage, disk usage, replication lag, backup status, connection count.
Redis / Redis Cluster
Load, memory usage, connection count, QPS.
Nginx
Status codes, connection info.
Other services: RabbitMQ, PHP‑FPM, OpenLDAP (IP, call count), Zimbra, OpenVPN (version, online users, IPs, traffic), ELK, Graylog, GitLab, Jenkins, MongoDB, HAProxy.
Network
Network quality, internet egress, dedicated line bandwidth, network devices.
Traffic Analysis
(Content omitted for brevity)
Log Monitoring
Use ELK, Graylog for anomaly and error keyword detection.
Security Monitoring
URL/API monitoring, custom solutions, Alibaba Cloud options.
Performance Monitoring (APM)
PinPoint, Zipkin, SkyWalking, CAT, Jaeger for Java, PHP, Go, Node.js distributed tracing.
Business Monitoring
Example for e‑commerce: orders per minute, registrations per minute, active users, promotion activities, traffic and profit generated.
Other
SSL certificate status.
Process liveness, port listening, log rotation.
Health metrics such as MQ backlog.
API success rate, latency, QPS.
Alerting
Email, SMS, DingTalk/WeChat/Enterprise WeChat, phone.
Alarm Handling
Self‑healing mechanisms like automatic server restart using Supervisor, systemd, or custom scripts.
Comprehensive Monitoring
Effective monitoring requires deep business understanding; software tools are merely enablers.
Efficient Ops
This public account is maintained by Xiaotianguo and friends, regularly publishing widely-read original technical articles. We focus on operations transformation and accompany you throughout your operations career, growing together happily.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.