Operations 7 min read

Essential Guide to Effective Monitoring in Operations: Goals, Methods, and Tools

This article outlines the essential components of operational monitoring, covering monitoring objectives, methods, core processes, key tools, metrics for hardware, system, application, network, and business layers, as well as alerting, handling, and best practices for building a comprehensive, reliable monitoring solution.

Efficient Ops
Efficient Ops
Efficient Ops
Essential Guide to Effective Monitoring in Operations: Goals, Methods, and Tools

Monitoring Objectives

Understanding the importance of monitoring and the business goals it should achieve.

Real‑time monitoring of target systems.

Feedback on current status of hardware, software, and services.

Ensure reliability so issues are reported instantly for rapid response.

Monitoring Methods

Identify monitoring objects (e.g., how CPU works).

Define performance baseline metrics such as CPU usage, load, user/kernel time, context switches.

Set alarm thresholds (e.g., what CPU load is considered high).

Establish fault‑handling procedures for efficient resolution.

Monitoring Core

Problem discovery.

Problem localization.

Problem resolution.

Summarize causes and preventive measures to avoid recurrence.

Monitoring Tools

Traditional tools: Cacti, Nagios, Smokeping.

Popular tools: Zabbix, OpenFalcon, Prometheus + Grafana, Nightingale, smartping, LEPUS, custom solutions.

Third‑party services: MonitoringBao, Tingyun, New Relic.

Monitoring Process

Collect : Data collection via SNMP, agents, ICMP, SSH, IPMI, etc.

Store : Store in databases such as MySQL, PostgreSQL.

Analyze : Provide graphs and timelines to locate faults.

Display : Show metric values and trends.

Alert : Notify via phone, email, WeChat, SMS, with escalation.

Handle : Determine fault level and assign responders for rapid remediation.

Monitoring Metrics

Hardware

CPU temperature, physical/virtual disks, motherboard temperature, RAID status (e.g., via MegaCli).

System

Host availability, CPU/memory/disk usage, inode, load, network bandwidth, TCP connections, disk I/O.

Application

MySQL

Service availability, memory usage, disk usage, replication lag, backup status, connection count.

Redis / Redis Cluster

Load, memory usage, connection count, QPS.

Nginx

Status codes, connection info.

Other services: RabbitMQ, PHP‑FPM, OpenLDAP (IP, call count), Zimbra, OpenVPN (version, online users, IPs, traffic), ELK, Graylog, GitLab, Jenkins, MongoDB, HAProxy.

Network

Network quality, internet egress, dedicated line bandwidth, network devices.

Traffic Analysis

(Content omitted for brevity)

Log Monitoring

Use ELK, Graylog for anomaly and error keyword detection.

Security Monitoring

URL/API monitoring, custom solutions, Alibaba Cloud options.

Performance Monitoring (APM)

PinPoint, Zipkin, SkyWalking, CAT, Jaeger for Java, PHP, Go, Node.js distributed tracing.

Business Monitoring

Example for e‑commerce: orders per minute, registrations per minute, active users, promotion activities, traffic and profit generated.

Other

SSL certificate status.

Process liveness, port listening, log rotation.

Health metrics such as MQ backlog.

API success rate, latency, QPS.

Alerting

Email, SMS, DingTalk/WeChat/Enterprise WeChat, phone.

Alarm Handling

Self‑healing mechanisms like automatic server restart using Supervisor, systemd, or custom scripts.

Comprehensive Monitoring

Effective monitoring requires deep business understanding; software tools are merely enablers.

monitoringoperationsmetricsAlertingsystem reliability
Efficient Ops
Written by

Efficient Ops

This public account is maintained by Xiaotianguo and friends, regularly publishing widely-read original technical articles. We focus on operations transformation and accompany you throughout your operations career, growing together happily.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.