Operations 13 min read

Build a Real-Time Linux Performance Alert System with Prometheus & Grafana

This guide walks you through designing a layered Linux monitoring architecture, selecting a Prometheus‑Grafana stack, defining key CPU, memory and disk metrics, crafting smart alert rules, visualizing dashboards, and adding automation and AI‑driven predictive techniques for reliable, business‑focused operations.

Raymond Ops
Raymond Ops
Raymond Ops
Build a Real-Time Linux Performance Alert System with Prometheus & Grafana

Why Traditional Monitoring Fails

Alert storm – a single incident generates hundreds of alerts, making prioritisation impossible.

False alarms – high CPU usage triggers alerts even when the business is healthy.

Monitoring lag – alerts arrive after users have already complained.

Single‑metric focus – only CPU and memory are watched, ignoring network I/O, disk latency and other critical indicators.

The root cause is a lack of systematic, business‑driven monitoring thinking.

Core Idea: Layered Monitoring Architecture

Three‑Layer Model

┌─────────────────┐
│  Business Layer │ <- response time, error rate, throughput
├─────────────────┤
│   Application   │ <- JVM, DB connection pool, cache hit rate
├─────────────────┤
│    System       │ <- CPU, memory, disk, network
└─────────────────┘

Key principle : derive monitoring metrics from user‑experience goals rather than piling up raw system numbers.

Practical Steps

Step 1 – Choose the Monitoring Stack

Recommended combination : Prometheus, Grafana, Alertmanager and node_exporter.

Prometheus uses a pull model, which fits cloud‑native environments.

It includes a built‑in time‑series database with high‑performance queries.

PromQL provides a powerful expression language for flexible alert definitions.

Step 2 – Define Critical Metrics

Advanced CPU Monitoring

# Examine each CPU state instead of total usage
# user   – user‑mode CPU
# system – kernel‑mode CPU
# iowait – time spent waiting for I/O (often a better bottleneck indicator)
# steal  – CPU time stolen by other VMs (relevant in virtualised environments)

# PromQL alert rule (trigger when non‑idle CPU > 80%)
(100 - (avg(irate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)) > 80

Expert tip : a high iowait value is usually more concerning than overall CPU utilisation because it signals I/O pressure.

Memory Monitoring Pitfalls

# Incorrect metric: used / total
# Correct metric: (total - available) / total
# Linux uses free memory for cache, so a high "used" value can be misleading.
# "available" reflects truly free memory.

# PromQL rule (alert when actual usage > 90%)
(1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)) * 100 > 90

Disk Monitoring Core Metrics

# Basic disk usage percentage
(1 - node_filesystem_avail_bytes / node_filesystem_size_bytes) * 100 > 85

# Advanced I/O latency (seconds)
rate(node_disk_io_time_seconds_total[5m]) > 0.5

# Expert‑level queue depth (weighted I/O time)
rate(node_disk_io_time_weighted_seconds_total[5m]) > 0.1

Practical insight : disk space exhaustion often causes more severe outages than CPU or memory issues and takes longer to recover.

Step 3 – Smart Alert Rule Design

Use tiered severity to differentiate business impact.

groups:
- name: system-alerts
  rules:
  - alert: HighCPUUsage
    expr: (100 - (avg(irate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)) > 80
    for: 5m
    labels:
      severity: warning
    annotations:
      summary: "CPU usage too high"
      description: "Host {{ $labels.instance }} CPU usage {{ $value }}% exceeds threshold"
  - alert: CriticalCPUUsage
    expr: (100 - (avg(irate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)) > 95
    for: 2m
    labels:
      severity: critical
    annotations:
      summary: "CPU usage critically high"
      description: "Host {{ $labels.instance }} CPU usage {{ $value }}% exceeds critical threshold"

Severity levels :

Warning – issue may exist but does not yet impact business.

Critical – immediate business impact, requires urgent handling.

Step 4 – Build a Visualization Dashboard

Dashboard design principles :

5‑Second Rule : key information must be visible within five seconds.

Color coding : green = normal, yellow = warning, red = critical.

Data density : display only essential metrics on a single screen; avoid pagination.

{
  "dashboard": {
    "title": "Linux System Monitoring",
    "panels": [
      {
        "title": "CPU Usage",
        "type": "stat",
        "targets": [{"expr": "100 - (avg(irate(node_cpu_seconds_total{mode=\"idle\"}[5m])) * 100)"}],
        "thresholds": [
          {"color": "green", "value": 0},
          {"color": "yellow", "value": 70},
          {"color": "red", "value": 90}
        ]
      }
    ]
  }
}

Advanced Techniques – Predictive Monitoring

Trend Analysis & Capacity Planning

# Predict when disk space will run out (7‑day horizon)
predict_linear(node_filesystem_avail_bytes[1h], 7*24*3600) < 0

# Predict memory availability trend (4‑hour horizon)
predict_linear(node_memory_MemAvailable_bytes[2h], 4*3600) < 1024*1024*1024

Practical value : detect capacity problems 1‑2 days early, allowing graceful scaling.

Anomaly Detection Algorithms

# Stddev‑based anomaly detection for network traffic
abs(rate(node_network_receive_bytes_total[5m]) -
    avg_over_time(rate(node_network_receive_bytes_total[5m])[1h])) >
    2 * stddev_over_time(rate(node_network_receive_bytes_total[5m])[1h])

Automation Integration

Alert Webhook Script

#!/bin/bash
# Webhook receives ALERT_NAME and triggers handling scripts
case "$ALERT_NAME" in
  "HighDiskUsage")
    # Auto‑clean old logs
    find /var/log -name "*.log" -mtime +7 -delete
    ;;
  "HighMemoryUsage")
    # Restart memory‑leak‑prone service
    systemctl restart high-memory-service
    ;;
esac

CI/CD Pipeline Integration (GitLab example)

deploy_production:
  script:
    - deploy.sh
  after_script:
    - |
        # Create monitoring rules after deployment
        curl -X POST "http://prometheus:9090/api/v1/rules" \
          -d "@monitoring-rules.yml"

Performance Optimisation Secrets

Monitoring System Optimisations

Data retention policy (prometheus.yml):

global:
  retention_time: 15d   # keep 15 days of data
  retention_size: 100GB # adjust according to disk capacity

Query optimisation – use recording rules to pre‑compute heavy queries:

record: instance:cpu_utilization:rate5m
expr: 100 - (avg(irate(node_cpu_seconds_total{mode="idle"}[5m])) by (instance) * 100)

Large‑Scale Environment Adaptation

Federated architecture : deploy multiple Prometheus instances per cluster and aggregate them with a top‑level Prometheus.

Service discovery : integrate with Consul, Kubernetes, etc., for dynamic target discovery.

Failure Cases

Case 1 – Sudden CPU Spike

Symptoms : CPU jumps from 20 % to 90 %.

top shows no obvious high‑CPU process.

iowait spikes – root cause traced to a disk fault.

iotop pinpoints the offending process.

Lesson : a single metric can be misleading; correlate across subsystems.

Case 2 – Hidden Memory Leak

Symptoms : system slows after several days; a reboot restores performance.

Solution : monitor per‑process memory trends.

# Alert when a process’s resident memory increases by >100 MiB in an hour
increase(process_resident_memory_bytes[1h]) > 100*1024*1024

Best‑Practice Summary

The "Three No" Principles

Don’t over‑monitor : too many metrics equal no monitoring.

Don’t ignore basics : start with fundamental system metrics.

Don’t detach from business : metrics must serve business objectives.

Ops Team Collaboration Roles

System Engineer : infrastructure monitoring.

Application Engineer : application‑level monitoring.

DBA : database‑specific monitoring.

Network Engineer : network device monitoring.

Resources

Git repositories:

https://github.com/raymond999999

https://gitee.com/raymond9

Official Prometheus documentation and CNCF project pages provide further reference material.

OpsLinuxPrometheusGrafana
Raymond Ops
Written by

Raymond Ops

Linux ops automation, cloud-native, Kubernetes, SRE, DevOps, Python, Golang and related tech discussions.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.