Build a Real-Time Linux Performance Alert System with Prometheus & Grafana
This guide walks you through designing a layered Linux monitoring architecture, selecting a Prometheus‑Grafana stack, defining key CPU, memory and disk metrics, crafting smart alert rules, visualizing dashboards, and adding automation and AI‑driven predictive techniques for reliable, business‑focused operations.
Why Traditional Monitoring Fails
Alert storm – a single incident generates hundreds of alerts, making prioritisation impossible.
False alarms – high CPU usage triggers alerts even when the business is healthy.
Monitoring lag – alerts arrive after users have already complained.
Single‑metric focus – only CPU and memory are watched, ignoring network I/O, disk latency and other critical indicators.
The root cause is a lack of systematic, business‑driven monitoring thinking.
Core Idea: Layered Monitoring Architecture
Three‑Layer Model
┌─────────────────┐
│ Business Layer │ <- response time, error rate, throughput
├─────────────────┤
│ Application │ <- JVM, DB connection pool, cache hit rate
├─────────────────┤
│ System │ <- CPU, memory, disk, network
└─────────────────┘Key principle : derive monitoring metrics from user‑experience goals rather than piling up raw system numbers.
Practical Steps
Step 1 – Choose the Monitoring Stack
Recommended combination : Prometheus, Grafana, Alertmanager and node_exporter.
Prometheus uses a pull model, which fits cloud‑native environments.
It includes a built‑in time‑series database with high‑performance queries.
PromQL provides a powerful expression language for flexible alert definitions.
Step 2 – Define Critical Metrics
Advanced CPU Monitoring
# Examine each CPU state instead of total usage
# user – user‑mode CPU
# system – kernel‑mode CPU
# iowait – time spent waiting for I/O (often a better bottleneck indicator)
# steal – CPU time stolen by other VMs (relevant in virtualised environments)
# PromQL alert rule (trigger when non‑idle CPU > 80%)
(100 - (avg(irate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)) > 80Expert tip : a high iowait value is usually more concerning than overall CPU utilisation because it signals I/O pressure.
Memory Monitoring Pitfalls
# Incorrect metric: used / total
# Correct metric: (total - available) / total
# Linux uses free memory for cache, so a high "used" value can be misleading.
# "available" reflects truly free memory.
# PromQL rule (alert when actual usage > 90%)
(1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)) * 100 > 90Disk Monitoring Core Metrics
# Basic disk usage percentage
(1 - node_filesystem_avail_bytes / node_filesystem_size_bytes) * 100 > 85
# Advanced I/O latency (seconds)
rate(node_disk_io_time_seconds_total[5m]) > 0.5
# Expert‑level queue depth (weighted I/O time)
rate(node_disk_io_time_weighted_seconds_total[5m]) > 0.1Practical insight : disk space exhaustion often causes more severe outages than CPU or memory issues and takes longer to recover.
Step 3 – Smart Alert Rule Design
Use tiered severity to differentiate business impact.
groups:
- name: system-alerts
rules:
- alert: HighCPUUsage
expr: (100 - (avg(irate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)) > 80
for: 5m
labels:
severity: warning
annotations:
summary: "CPU usage too high"
description: "Host {{ $labels.instance }} CPU usage {{ $value }}% exceeds threshold"
- alert: CriticalCPUUsage
expr: (100 - (avg(irate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)) > 95
for: 2m
labels:
severity: critical
annotations:
summary: "CPU usage critically high"
description: "Host {{ $labels.instance }} CPU usage {{ $value }}% exceeds critical threshold"Severity levels :
Warning – issue may exist but does not yet impact business.
Critical – immediate business impact, requires urgent handling.
Step 4 – Build a Visualization Dashboard
Dashboard design principles :
5‑Second Rule : key information must be visible within five seconds.
Color coding : green = normal, yellow = warning, red = critical.
Data density : display only essential metrics on a single screen; avoid pagination.
{
"dashboard": {
"title": "Linux System Monitoring",
"panels": [
{
"title": "CPU Usage",
"type": "stat",
"targets": [{"expr": "100 - (avg(irate(node_cpu_seconds_total{mode=\"idle\"}[5m])) * 100)"}],
"thresholds": [
{"color": "green", "value": 0},
{"color": "yellow", "value": 70},
{"color": "red", "value": 90}
]
}
]
}
}Advanced Techniques – Predictive Monitoring
Trend Analysis & Capacity Planning
# Predict when disk space will run out (7‑day horizon)
predict_linear(node_filesystem_avail_bytes[1h], 7*24*3600) < 0
# Predict memory availability trend (4‑hour horizon)
predict_linear(node_memory_MemAvailable_bytes[2h], 4*3600) < 1024*1024*1024Practical value : detect capacity problems 1‑2 days early, allowing graceful scaling.
Anomaly Detection Algorithms
# Stddev‑based anomaly detection for network traffic
abs(rate(node_network_receive_bytes_total[5m]) -
avg_over_time(rate(node_network_receive_bytes_total[5m])[1h])) >
2 * stddev_over_time(rate(node_network_receive_bytes_total[5m])[1h])Automation Integration
Alert Webhook Script
#!/bin/bash
# Webhook receives ALERT_NAME and triggers handling scripts
case "$ALERT_NAME" in
"HighDiskUsage")
# Auto‑clean old logs
find /var/log -name "*.log" -mtime +7 -delete
;;
"HighMemoryUsage")
# Restart memory‑leak‑prone service
systemctl restart high-memory-service
;;
esacCI/CD Pipeline Integration (GitLab example)
deploy_production:
script:
- deploy.sh
after_script:
- |
# Create monitoring rules after deployment
curl -X POST "http://prometheus:9090/api/v1/rules" \
-d "@monitoring-rules.yml"Performance Optimisation Secrets
Monitoring System Optimisations
Data retention policy (prometheus.yml):
global:
retention_time: 15d # keep 15 days of data
retention_size: 100GB # adjust according to disk capacityQuery optimisation – use recording rules to pre‑compute heavy queries:
record: instance:cpu_utilization:rate5m
expr: 100 - (avg(irate(node_cpu_seconds_total{mode="idle"}[5m])) by (instance) * 100)Large‑Scale Environment Adaptation
Federated architecture : deploy multiple Prometheus instances per cluster and aggregate them with a top‑level Prometheus.
Service discovery : integrate with Consul, Kubernetes, etc., for dynamic target discovery.
Failure Cases
Case 1 – Sudden CPU Spike
Symptoms : CPU jumps from 20 % to 90 %.
top shows no obvious high‑CPU process.
iowait spikes – root cause traced to a disk fault.
iotop pinpoints the offending process.
Lesson : a single metric can be misleading; correlate across subsystems.
Case 2 – Hidden Memory Leak
Symptoms : system slows after several days; a reboot restores performance.
Solution : monitor per‑process memory trends.
# Alert when a process’s resident memory increases by >100 MiB in an hour
increase(process_resident_memory_bytes[1h]) > 100*1024*1024Best‑Practice Summary
The "Three No" Principles
Don’t over‑monitor : too many metrics equal no monitoring.
Don’t ignore basics : start with fundamental system metrics.
Don’t detach from business : metrics must serve business objectives.
Ops Team Collaboration Roles
System Engineer : infrastructure monitoring.
Application Engineer : application‑level monitoring.
DBA : database‑specific monitoring.
Network Engineer : network device monitoring.
Resources
Git repositories:
https://github.com/raymond999999
https://gitee.com/raymond9
Official Prometheus documentation and CNCF project pages provide further reference material.
Raymond Ops
Linux ops automation, cloud-native, Kubernetes, SRE, DevOps, Python, Golang and related tech discussions.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
