From Midnight Alerts to Peaceful Sleep: Building a Zabbix Monitoring System
After a costly midnight outage, the author shares how he designed a three‑layer Zabbix monitoring architecture—covering infrastructure, service, and business metrics—optimizing alert thresholds, automating discovery, and integrating with ITSM, ultimately reducing MTTR to minutes and enabling teams to sleep peacefully.
From Midnight Alerts to Peaceful Sleep: My Practical Enterprise‑Level Zabbix Monitoring System
I. Monitoring Pitfalls We’ve Encountered
At 3 am the phone rang: the production database was down. The old cron‑based check only noticed the failure at 2:50 am, and the alert email was delayed until 3:15 am, causing a 30‑minute transaction loss worth over ¥200,000.
The monitoring system is essentially the "nervous system" of IT infrastructure; it must be sensitive enough to detect problems but not so noisy that it becomes a "wolf‑calling" scenario.
II. Why Zabbix? Not a Promotional Piece
Many tools exist—Prometheus, Nagios, Grafana, cloud monitoring—but Zabbix offers the best cost‑performance for small‑to‑medium enterprises.
Prometheus + Grafana : flexible but requires extra Alertmanager and high expertise.
Cloud monitoring services : charged per metric, can exceed ¥80,000 per year.
Nagios : cumbersome configuration, limited extensibility.
Zabbix : built‑in web UI, alerts, visualization, rich templates, ready‑to‑use.
Using Zabbix, the entire monitoring system was built in two weeks at the cost of a single server.
III. The Three‑Layer Defense Model for Monitoring
Rather than a chaotic setup, the author adopts a three‑layer design similar to security defense.
Layer 1: Infrastructure Monitoring (Must Not Fail)
Hardware : CPU, memory, disk, network.
System : processes, ports, log files.
Network : switches, routers, firewalls.
Practical tip: Do not trigger on a raw CPU 80% threshold; use a 10‑minute average to filter normal spikes.
# In Zabbix, use Calculated Items
avg(/Linux-Server/system.cpu.util[,user],10m) > 80
# Alert only when the 10‑minute average exceeds 80%, filtering transient spikesLayer 2: Service Monitoring (Is It Usable?)
Application : web response time, API availability.
Database : connection count, slow queries, lock waits.
Middleware : Redis/MQ queue depth, Nginx connections.
Lesson learned: During a Double‑11 sale, all server metrics looked normal, but the database connection pool was full, causing order failures that were missed by a simple “service up” check.
Standard web‑scenario configuration:
# Zabbix built‑in web scenario monitoring
Scenario name: User login full flow
Step 1: GET home page (expect HTTP 200)
Step 2: POST /login (expect token)
Step 3: GET /api/userinfo (expect JSON with user_id)
# Execute every minute; any step failure triggers an alertLayer 3: Business Metric Monitoring (Revenue Impact)
Orders per minute (drop to 0 may indicate system failure).
Payment success rate (below 98 % burns money).
Online user count (sudden surge may indicate attack).
Real case: An e‑commerce client saw order volume fall from 500‑800/min to 50/min at night; the alert revealed upstream rate‑limiting on the payment API, preventing a potential million‑yuan loss.
IV. The Art of Alert Design: Less Is More
Common mistake: Over‑alerting creates noise; the team ends up silencing alerts.
My Alert Severity Levels
P0 (Immediate, phone bomb)
Core business system down.
Primary database unavailable.
Payment interface completely failing.
// Zabbix action configuration
Trigger condition: severity = disaster
Actions:
1. Send SMS instantly to on‑call staff (rotate 3 people)
2. Phone notification every 5 min until acknowledged
3. Push to DingTalk/WeChat WorkP1 (Handled within 1 hour, SMS)
Backup system failure.
Disk usage > 85 %.
Sudden surge in DB slow queries.
P2 (Work‑hour handling, email)
Non‑core service anomaly.
Minor performance degradation.
Backup job failure.
Key technique: alert deduplication
# Use Zabbix Trigger Dependencies
Master trigger: host unreachable
Dependent triggers: all other alerts from that host
# Logic: if the host is down, suppress CPU high alertsV. Best Practices Summarized from Pitfalls
1. Monitor Zabbix Itself (Avoid Single Point of Failure)
Deploy a separate server to monitor the Zabbix server, or use a cloud monitoring service as a fallback, and regularly check Zabbix database size.
2. Historical Data Retention Strategy
Zabbix can consume disk space quickly; set sensible retention periods.
-- Retention policy
Trend data (1‑hour average): keep 2 years
History data (raw): keep 90 days
Event data (alerts): keep 1 year
-- Weekly cleanup
UPDATE items SET history='90d', trends='730d';3. Template‑Based Management
Do not configure each host individually; create templates per service type and assign them.
Template hierarchy:
├── Template_OS_Linux_Base (common Linux)
├── Template_App_Nginx (web server)
├── Template_App_MySQL (database server)
└── Template_Business_OrderService (business system)
Host association: link a web host with "OS_Linux" + "Nginx" + "OrderService" templates4. Automated Discovery vs Manual Registration
Small scale (<100 hosts): manual registration for fine‑grained control. Medium (100‑500): network discovery + auto‑registration. Large (>500): integrate with CMDB for dynamic sync.
# Auto‑registration script (Agent active mode)
curl -X POST http://zabbix-server/api_jsonrpc.php \
-H "Content-Type: application/json-rpc" \
-d '{
"jsonrpc": "2.0",
"method": "host.create",
"params": {
"host": "$(hostname)",
"groups": [{"groupid": "2"}],
"templates": [{"templateid": "10001"}],
"metadata": "env='$ENV',region='$REGION'"
}
}'VI. Advanced Techniques: Making Zabbix Smarter
Intelligent Baseline Alerts
Traditional alert: CPU > 80 %. Intelligent alert: CPU exceeds the 7‑day same‑time average by 30 %.
// Predict disk will fill in 1 hour
timeleft(/Linux-Server/vfs.fs.size[/,free],1h,"linear") < 3600Integration with ITSM Systems
When a P0 incident occurs, automatically create a ticket in Jira or ZenTao.
# Zabbix Media Type script (Python)
import requests
def create_ticket(alert_message):
jira_api = "https://jira.company.com/api/ticket"
ticket_data = {
"title": f"[P0] {alert_message}",
"assignee": get_oncall_engineer(),
"labels": ["production", "urgent"]
}
requests.post(jira_api, json=ticket_data)Capacity Planning
Use historical trends to forecast resource needs, e.g., disk growth predicts expansion in 45 days, DB connection growth signals pool scaling before the next promotion.
VII. The Future of Monitoring: AIOps Direction
Automated anomaly detection : machine‑learning models reduce false positives.
Root‑cause analysis : automatically trace dependency chains during failures.
Self‑healing : combine Ansible/SaltStack to auto‑remediate issues.
Cloud‑native integration : monitor Kubernetes, service mesh, serverless functions.
Example: when Nginx process dies, an Ansible playbook restarts it; only if the restart fails is a human notified.
# Zabbix + Ansible linkage
- name: Auto Restart Nginx
hosts: webservers
tasks:
- name: Check Nginx status
service_facts:
- name: Restart if down
when: ansible_facts.services['nginx'].state != 'running'
service:
name: nginx
state: restartedVIII. Final Thoughts
Building a monitoring system is not the goal; enabling the team to sleep peacefully is.
After two years, the system averages fewer than two P0 alerts per month, keeps false‑positive rate under 5 %, reduces MTTD from 30 minutes to 2 minutes, and MTTR from 1 hour to 15 minutes.
Actionable advice:
Start small—focus on core business metrics before expanding.
Continuously improve—after each incident, ask whether monitoring could have detected it earlier.
Document—maintain runbooks for each alert so newcomers can respond quickly.
Ops Community
A leading IT operations community where professionals share and grow together.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
