Operations 8 min read

How to Build an Effective Monitoring System for Reliable Operations

This article outlines the goals, methods, core steps, tools, metrics, and alert handling strategies essential for designing a comprehensive monitoring system that ensures system reliability and continuous business operation.

Efficient Ops
Efficient Ops
Efficient Ops
How to Build an Effective Monitoring System for Reliable Operations

Monitoring Objectives

Understand the importance of monitoring and define business goals such as real‑time visibility of system health, ensuring reliability, and enabling rapid incident response.

Real‑time monitoring of target systems

Feedback on current status of hardware, software, and services

Guarantee reliability so that issues are reported instantly for operations staff to address

Monitoring Methods

Identify monitoring objects (e.g., how CPU works)

Define performance baseline metrics (CPU usage, load, user/kernel time, context switches)

Set alarm thresholds (e.g., what CPU load is considered high)

Design efficient fault‑handling processes

Monitoring Core Steps

Problem discovery

Problem localization

Problem resolution

Post‑mortem analysis to prevent recurrence

Monitoring Tools

Traditional: Cacti, Nagios, Smokeping

Popular: Zabbix, OpenFalcon, Prometheus + Grafana, Nightingale, smartping (network), LEPUS (database), custom solutions

Third‑party: Jiankongbao, Tingyun, New Relic

Monitoring Process

Collect : Gather data via SNMP, agents, ICMP, SSH, IPMI, etc.

Store : Persist data in databases such as MySQL or PostgreSQL

Analyze : Generate graphs and timelines to aid fault location

Display : Show metric values and trends

Alert: Notify via phone, email, WeChat, SMS, with escalation mechanisms

Handle : Classify incident severity and assign responders for rapid remediation

Monitoring Metrics

Hardware

CPU temperature, physical/virtual disks, motherboard temperature, RAID status (via MegaCli, IPMI)

System

Host availability, CPU/memory/disk usage, inode usage, load, network bandwidth, TCP connections, disk I/O

Application

MySQL

Service availability, memory usage, disk usage, replication lag, backup status, connection count

Redis / Redis Cluster

Load, memory usage, connection count, QPS

Nginx

Status codes, connection info

Other services: RabbitMQ, PHP‑FPM, OpenLDAP (IP, call count), Zimbra, OpenVPN (version, online users, traffic), ELK, Graylog, GitLab, Jenkins, MongoDB, HAProxy

Network

Network quality, public egress, dedicated line bandwidth, network devices

Traffic Analysis

Log Monitoring

Security Monitoring

URL/API monitoring, custom solutions, Alibaba Cloud options

Performance Monitoring (APM)

PinPoint, Zipkin, SkyWalking, CAT, Jaeger

Business Monitoring (e.g., e‑commerce)

Orders per minute, registrations per minute, active users per minute, promotional activity counts, traffic, and profit generated by campaigns

Other

SSL certificate status

Process liveness, port listening, log rotation

Health metrics such as MQ backlog

API success rate, latency, QPS

Alert Channels

Email

SMS

Instant messaging (DingTalk, WeChat, Enterprise WeChat)

Phone calls

Alert Handling

Self‑healing mechanisms (e.g., automatic server restart) using Supervisor, systemd, or custom scripts.

Comprehensive Monitoring

Hardware

Use SNMP for routers/switches; IPMI for other hardware. In public clouds, this layer may be omitted.

System

Standard OS metrics and custom data collection.

Service

Built‑in service metrics (e.g., Nginx status module, PHP‑FPM status)

Custom queries (e.g., MySQL SHOW GLOBAL STATUS, Redis INFO)

Network monitoring in hybrid clouds (Smokeping, smartping)

Security monitoring via cloud security groups, iptables, hardware firewalls, or Nginx+Lua web firewalls

Log monitoring with ELK or Graylog for error keyword detection

Business‑specific metrics tailored to each application

Traffic analysis using Baidu/Tencent analytics or self‑hosted Piwik

Visualization dashboards

Automated monitoring via APIs for batch operations

Monitoring Summary

A complete monitoring system requires deep business understanding; software tools are merely enablers.

MonitoringOperationsobservabilitymetricsAlertingsystem reliability
Efficient Ops
Written by

Efficient Ops

This public account is maintained by Xiaotianguo and friends, regularly publishing widely-read original technical articles. We focus on operations transformation and accompany you throughout your operations career, growing together happily.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.