How to Build an Effective Monitoring System for Reliable Operations
This article outlines the goals, methods, core steps, tools, metrics, and alert handling strategies essential for designing a comprehensive monitoring system that ensures system reliability and continuous business operation.
Monitoring Objectives
Understand the importance of monitoring and define business goals such as real‑time visibility of system health, ensuring reliability, and enabling rapid incident response.
Real‑time monitoring of target systems
Feedback on current status of hardware, software, and services
Guarantee reliability so that issues are reported instantly for operations staff to address
Monitoring Methods
Identify monitoring objects (e.g., how CPU works)
Define performance baseline metrics (CPU usage, load, user/kernel time, context switches)
Set alarm thresholds (e.g., what CPU load is considered high)
Design efficient fault‑handling processes
Monitoring Core Steps
Problem discovery
Problem localization
Problem resolution
Post‑mortem analysis to prevent recurrence
Monitoring Tools
Traditional: Cacti, Nagios, Smokeping
Popular: Zabbix, OpenFalcon, Prometheus + Grafana, Nightingale, smartping (network), LEPUS (database), custom solutions
Third‑party: Jiankongbao, Tingyun, New Relic
Monitoring Process
Collect : Gather data via SNMP, agents, ICMP, SSH, IPMI, etc.
Store : Persist data in databases such as MySQL or PostgreSQL
Analyze : Generate graphs and timelines to aid fault location
Display : Show metric values and trends
Alert: Notify via phone, email, WeChat, SMS, with escalation mechanisms
Handle : Classify incident severity and assign responders for rapid remediation
Monitoring Metrics
Hardware
CPU temperature, physical/virtual disks, motherboard temperature, RAID status (via MegaCli, IPMI)
System
Host availability, CPU/memory/disk usage, inode usage, load, network bandwidth, TCP connections, disk I/O
Application
MySQL
Service availability, memory usage, disk usage, replication lag, backup status, connection count
Redis / Redis Cluster
Load, memory usage, connection count, QPS
Nginx
Status codes, connection info
Other services: RabbitMQ, PHP‑FPM, OpenLDAP (IP, call count), Zimbra, OpenVPN (version, online users, traffic), ELK, Graylog, GitLab, Jenkins, MongoDB, HAProxy
Network
Network quality, public egress, dedicated line bandwidth, network devices
Traffic Analysis
Log Monitoring
Security Monitoring
URL/API monitoring, custom solutions, Alibaba Cloud options
Performance Monitoring (APM)
PinPoint, Zipkin, SkyWalking, CAT, Jaeger
Business Monitoring (e.g., e‑commerce)
Orders per minute, registrations per minute, active users per minute, promotional activity counts, traffic, and profit generated by campaigns
Other
SSL certificate status
Process liveness, port listening, log rotation
Health metrics such as MQ backlog
API success rate, latency, QPS
Alert Channels
SMS
Instant messaging (DingTalk, WeChat, Enterprise WeChat)
Phone calls
Alert Handling
Self‑healing mechanisms (e.g., automatic server restart) using Supervisor, systemd, or custom scripts.
Comprehensive Monitoring
Hardware
Use SNMP for routers/switches; IPMI for other hardware. In public clouds, this layer may be omitted.
System
Standard OS metrics and custom data collection.
Service
Built‑in service metrics (e.g., Nginx status module, PHP‑FPM status)
Custom queries (e.g., MySQL SHOW GLOBAL STATUS, Redis INFO)
Network monitoring in hybrid clouds (Smokeping, smartping)
Security monitoring via cloud security groups, iptables, hardware firewalls, or Nginx+Lua web firewalls
Log monitoring with ELK or Graylog for error keyword detection
Business‑specific metrics tailored to each application
Traffic analysis using Baidu/Tencent analytics or self‑hosted Piwik
Visualization dashboards
Automated monitoring via APIs for batch operations
Monitoring Summary
A complete monitoring system requires deep business understanding; software tools are merely enablers.
Efficient Ops
This public account is maintained by Xiaotianguo and friends, regularly publishing widely-read original technical articles. We focus on operations transformation and accompany you throughout your operations career, growing together happily.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.