Operations 15 min read

From Midnight Alerts to Peaceful Sleep: Building a Zabbix Monitoring System

After a costly midnight outage, the author shares how he designed a three‑layer Zabbix monitoring architecture—covering infrastructure, service, and business metrics—optimizing alert thresholds, automating discovery, and integrating with ITSM, ultimately reducing MTTR to minutes and enabling teams to sleep peacefully.

Ops Community
Ops Community
Ops Community
From Midnight Alerts to Peaceful Sleep: Building a Zabbix Monitoring System

From Midnight Alerts to Peaceful Sleep: My Practical Enterprise‑Level Zabbix Monitoring System

I. Monitoring Pitfalls We’ve Encountered

At 3 am the phone rang: the production database was down. The old cron‑based check only noticed the failure at 2:50 am, and the alert email was delayed until 3:15 am, causing a 30‑minute transaction loss worth over ¥200,000.

The monitoring system is essentially the "nervous system" of IT infrastructure; it must be sensitive enough to detect problems but not so noisy that it becomes a "wolf‑calling" scenario.

II. Why Zabbix? Not a Promotional Piece

Many tools exist—Prometheus, Nagios, Grafana, cloud monitoring—but Zabbix offers the best cost‑performance for small‑to‑medium enterprises.

Prometheus + Grafana : flexible but requires extra Alertmanager and high expertise.

Cloud monitoring services : charged per metric, can exceed ¥80,000 per year.

Nagios : cumbersome configuration, limited extensibility.

Zabbix : built‑in web UI, alerts, visualization, rich templates, ready‑to‑use.

Using Zabbix, the entire monitoring system was built in two weeks at the cost of a single server.

III. The Three‑Layer Defense Model for Monitoring

Rather than a chaotic setup, the author adopts a three‑layer design similar to security defense.

Layer 1: Infrastructure Monitoring (Must Not Fail)

Hardware : CPU, memory, disk, network.

System : processes, ports, log files.

Network : switches, routers, firewalls.

Practical tip: Do not trigger on a raw CPU 80% threshold; use a 10‑minute average to filter normal spikes.

# In Zabbix, use Calculated Items
avg(/Linux-Server/system.cpu.util[,user],10m) > 80
# Alert only when the 10‑minute average exceeds 80%, filtering transient spikes

Layer 2: Service Monitoring (Is It Usable?)

Application : web response time, API availability.

Database : connection count, slow queries, lock waits.

Middleware : Redis/MQ queue depth, Nginx connections.

Lesson learned: During a Double‑11 sale, all server metrics looked normal, but the database connection pool was full, causing order failures that were missed by a simple “service up” check.

Standard web‑scenario configuration:

# Zabbix built‑in web scenario monitoring
Scenario name: User login full flow
Step 1: GET home page (expect HTTP 200)
Step 2: POST /login (expect token)
Step 3: GET /api/userinfo (expect JSON with user_id)
# Execute every minute; any step failure triggers an alert

Layer 3: Business Metric Monitoring (Revenue Impact)

Orders per minute (drop to 0 may indicate system failure).

Payment success rate (below 98 % burns money).

Online user count (sudden surge may indicate attack).

Real case: An e‑commerce client saw order volume fall from 500‑800/min to 50/min at night; the alert revealed upstream rate‑limiting on the payment API, preventing a potential million‑yuan loss.

IV. The Art of Alert Design: Less Is More

Common mistake: Over‑alerting creates noise; the team ends up silencing alerts.

My Alert Severity Levels

P0 (Immediate, phone bomb)

Core business system down.

Primary database unavailable.

Payment interface completely failing.

// Zabbix action configuration
Trigger condition: severity = disaster
Actions:
1. Send SMS instantly to on‑call staff (rotate 3 people)
2. Phone notification every 5 min until acknowledged
3. Push to DingTalk/WeChat Work

P1 (Handled within 1 hour, SMS)

Backup system failure.

Disk usage > 85 %.

Sudden surge in DB slow queries.

P2 (Work‑hour handling, email)

Non‑core service anomaly.

Minor performance degradation.

Backup job failure.

Key technique: alert deduplication

# Use Zabbix Trigger Dependencies
Master trigger: host unreachable
Dependent triggers: all other alerts from that host
# Logic: if the host is down, suppress CPU high alerts

V. Best Practices Summarized from Pitfalls

1. Monitor Zabbix Itself (Avoid Single Point of Failure)

Deploy a separate server to monitor the Zabbix server, or use a cloud monitoring service as a fallback, and regularly check Zabbix database size.

2. Historical Data Retention Strategy

Zabbix can consume disk space quickly; set sensible retention periods.

-- Retention policy
Trend data (1‑hour average): keep 2 years
History data (raw): keep 90 days
Event data (alerts): keep 1 year

-- Weekly cleanup
UPDATE items SET history='90d', trends='730d';

3. Template‑Based Management

Do not configure each host individually; create templates per service type and assign them.

Template hierarchy:
├── Template_OS_Linux_Base (common Linux)
├── Template_App_Nginx (web server)
├── Template_App_MySQL (database server)
└── Template_Business_OrderService (business system)

Host association: link a web host with "OS_Linux" + "Nginx" + "OrderService" templates

4. Automated Discovery vs Manual Registration

Small scale (<100 hosts): manual registration for fine‑grained control. Medium (100‑500): network discovery + auto‑registration. Large (>500): integrate with CMDB for dynamic sync.

# Auto‑registration script (Agent active mode)
curl -X POST http://zabbix-server/api_jsonrpc.php \
  -H "Content-Type: application/json-rpc" \
  -d '{
    "jsonrpc": "2.0",
    "method": "host.create",
    "params": {
      "host": "$(hostname)",
      "groups": [{"groupid": "2"}],
      "templates": [{"templateid": "10001"}],
      "metadata": "env='$ENV',region='$REGION'"
    }
  }'

VI. Advanced Techniques: Making Zabbix Smarter

Intelligent Baseline Alerts

Traditional alert: CPU > 80 %. Intelligent alert: CPU exceeds the 7‑day same‑time average by 30 %.

// Predict disk will fill in 1 hour
timeleft(/Linux-Server/vfs.fs.size[/,free],1h,"linear") < 3600

Integration with ITSM Systems

When a P0 incident occurs, automatically create a ticket in Jira or ZenTao.

# Zabbix Media Type script (Python)
import requests

def create_ticket(alert_message):
    jira_api = "https://jira.company.com/api/ticket"
    ticket_data = {
        "title": f"[P0] {alert_message}",
        "assignee": get_oncall_engineer(),
        "labels": ["production", "urgent"]
    }
    requests.post(jira_api, json=ticket_data)

Capacity Planning

Use historical trends to forecast resource needs, e.g., disk growth predicts expansion in 45 days, DB connection growth signals pool scaling before the next promotion.

VII. The Future of Monitoring: AIOps Direction

Automated anomaly detection : machine‑learning models reduce false positives.

Root‑cause analysis : automatically trace dependency chains during failures.

Self‑healing : combine Ansible/SaltStack to auto‑remediate issues.

Cloud‑native integration : monitor Kubernetes, service mesh, serverless functions.

Example: when Nginx process dies, an Ansible playbook restarts it; only if the restart fails is a human notified.

# Zabbix + Ansible linkage
- name: Auto Restart Nginx
  hosts: webservers
  tasks:
    - name: Check Nginx status
      service_facts:
    - name: Restart if down
      when: ansible_facts.services['nginx'].state != 'running'
      service:
        name: nginx
        state: restarted

VIII. Final Thoughts

Building a monitoring system is not the goal; enabling the team to sleep peacefully is.

After two years, the system averages fewer than two P0 alerts per month, keeps false‑positive rate under 5 %, reduces MTTD from 30 minutes to 2 minutes, and MTTR from 1 hour to 15 minutes.

Actionable advice:

Start small—focus on core business metrics before expanding.

Continuously improve—after each incident, ask whether monitoring could have detected it earlier.

Document—maintain runbooks for each alert so newcomers can respond quickly.

AutomationalertingaiopsITSMZabbix
Ops Community
Written by

Ops Community

A leading IT operations community where professionals share and grow together.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.