How to Build a Practical Monitoring System for Small and Medium Enterprises
An in‑depth guide walks readers through building a comprehensive monitoring system for small‑to‑medium enterprises, covering hardware, system, application, network, security, traffic analysis, business metrics, log aggregation, automation, visualization, and practical integration with tools like Zabbix, IPMI, ELK, and Smokeping.
This article, compiled from the "Efficient Operations" WeChat group talks, explains how to construct a relatively complete monitoring system for small and medium enterprises.
From Interview Start
During interviews, candidates are often asked how their previous companies handled monitoring. The author uses a fictional newcomer, Xiao Wang, to illustrate the process.
1. Define Goals and Align Mindset
The ultimate goal of monitoring is to ensure continuous and stable business operation. Before implementing any tool, one must understand the monitoring objects, their metrics, and alarm thresholds.
2. Story Begins
Xiao Wang, a fresh graduate, is tasked with setting up monitoring for an e‑commerce startup.
2.1 Hardware Monitoring
Basic hardware monitoring includes regular rack checks and using IPMI to collect temperature, disk health, etc. Xiao Wang wrote a simple script that queries ipmi and sends an email when temperature exceeds 50°C.
2.2 System Monitoring
Key system metrics are CPU, memory, and I/O. For CPU, monitor utilization, context switches, and run‑queue length (target run‑queue ≤ 3, user/system ratio ≈ 70/30). Common tools: top, vmstat, mpstat. Memory monitoring includes usage, swap, and detecting leaks. I/O monitoring covers disk usage, iowait, and network traffic using iostat, iotop, iftop.
2.3 Application Service Monitoring
Monitor services such as Apache (mod_status), Nginx (stub_status), Memcached (stats), Redis (info), and JVM (JMX). Scripts using grep, awk, or netcat retrieve status data. API endpoints are also checked via curl.
2.4 Introducing Zabbix
To avoid a proliferation of ad‑hoc scripts, Xiao Wang adopts Zabbix, consolidating all monitoring domains:
Hardware : Zabbix IPMI interface
System : Zabbix agent
Java : Zabbix JMX
Network devices : Zabbix SNMP
Application services : Zabbix user‑parameter
MySQL : percona‑monitoring‑plugins
URL : Zabbix web monitoring
Zabbix also provides auto‑discovery, proxy‑based distributed monitoring, and flexible alarm routing (email, WeChat, SMS, DingTalk).
2.5 Traffic Analysis
Beyond basic logs, Xiao Wang evaluates traffic using Google Analytics, Baidu Tongji, and the open‑source Piwik (Matomo) to obtain detailed visitor and conversion data.
2.6 Network Monitoring
For nationwide e‑commerce services, network health is tracked with Smokeping (Perl‑based, uses rrdtool) and commercial services for CDN status.
2.7 Security Monitoring
Layer‑7 protection is added via an Nginx+Lua WAF, with logs sent to Elasticsearch and visualized in Kibana. A Python crawler periodically scans GitHub for sensitive keywords.
2.8 Business Monitoring
Business‑level KPIs such as orders per minute, registrations, DAU, and SMS usage are added to Zabbix, with appropriate thresholds and alerts.
2.9 Log Monitoring
System, application, and service logs are centralized using the ELK stack (Logstash → Elasticsearch → Kibana). Errors trigger Zabbix alerts for rapid response.
2.10 Automation
Automation is achieved via Zabbix auto‑discovery (active) and Zabbix API calls (passive) tied to a CMDB, enabling automatic template assignment when new services appear.
2.11 Visualization
Effective dashboards combine traditional monitoring data with business analytics to quickly pinpoint the root cause of anomalies such as sudden drops in order volume.
Interview Ends
The author acknowledges that monitoring is an ongoing effort, with many additional topics like front‑end performance, code monitoring, and even public opinion monitoring.
FAQ
Is automatic fault remediation possible? Approaches vary; the author cites Tencent BlueKing as an example.
Should operations understand business? Yes, business awareness helps ops deliver value and troubleshoot user‑facing issues.
Is a CMDB necessary? A CMDB provides essential asset information for automated, reliable operations.
Efficient Ops
This public account is maintained by Xiaotianguo and friends, regularly publishing widely-read original technical articles. We focus on operations transformation and accompany you throughout your operations career, growing together happily.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.