Master Linux Host Monitoring: Prometheus, Node Exporter, Thresholds & Scripts
This comprehensive guide walks you through building a robust Linux host monitoring system with Prometheus and node_exporter, covering CPU, memory, disk, and network metrics, practical threshold formulas, ready‑to‑run Bash scripts, Alertmanager rules, Grafana dashboards, and best‑practice recommendations for reliable operations.
Background
Monitoring is the eyes of operations. Without it, engineers react only after failures occur. Effective monitoring requires selecting the right metrics, setting sensible thresholds, and avoiding alert fatigue caused by noisy or missing data.
Prerequisites
Familiarity with Linux commands (top, free, df, iostat)
Understanding of the /proc filesystem
Basic knowledge of
systemdMonitoring Architecture
[Monitored Host] [Collection Layer] [Storage/Visualization Layer]
node_exporter (9100) Prometheus
CPU/Memory/Disk/Network → cAdvisor (8080) → Alertmanager
Base host metrics → kube-state-metrics → Grafana
MySQL exporter (9104)
Redis exporter (9121)The core exporter for Linux hosts is node_exporter, which gathers CPU, memory, disk, network, filesystem, and load metrics.
1. Install and Configure node_exporter
#!/bin/bash
EXPORTER_VERSION="1.8.2"
DOWNLOAD_URL="https://github.com/prometheus/node_exporter/releases/download/v${EXPORTER_VERSION}/node_exporter-${EXPORTER_VERSION}.linux-amd64.tar.gz"
cd /tmp
curl -LO "$DOWNLOAD_URL"
tar xzf node_exporter-${EXPORTER_VERSION}.linux-amd64.tar.gz
sudo mv node_exporter-${EXPORTER_VERSION}.linux-amd64/node_exporter /usr/local/bin/
sudo chmod +x /usr/local/bin/node_exporter
sudo useradd -rs /bin/false node_exporter || true
sudo tee /etc/systemd/system/node_exporter.service > /dev/null <<'EOF'
[Unit]
Description=Prometheus Node Exporter
After=network.target
[Service]
Type=simple
User=node_exporter
Group=node_exporter
ExecStart=/usr/local/bin/node_exporter \
--collector.cpu \
--collector.meminfo \
--collector.diskstats \
--collector.filesystem \
--collector.netdev \
--collector.loadavg \
--collector.pressure \
--collector.mpstat \
--web.listen-address=:9100 \
--web.disable-exporter-metrics
Restart=always
[Install]
WantedBy=multi-user.target
EOF
sudo systemctl daemon-reload
sudo systemctl enable --now node_exporter
echo "node_exporter started, listening on port 9100"1.1 Quick Metric Reference
# View exported metrics locally
curl -s http://localhost:9100/metrics | grep -E "^(node_cpu|node_memory|node_disk|node_network|node_filesystem)" | head -50
# Common metric names
# node_cpu_seconds_total # CPU time
# node_memory_MemTotal_bytes # Total memory
# node_memory_MemAvailable_bytes # Available memory
# node_disk_read_bytes_total # Disk read bytes
# node_network_receive_bytes_total # Network receive bytes
# node_load1 # 1‑minute load average2. CPU Monitoring
2.1 CPU Utilization
CPU usage is the most direct performance indicator.
# Average CPU usage across all cores (percentage)
100 - (avg by (instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)
# Single‑core usage
100 - (rate(node_cpu_seconds_total{mode="idle"}[5m]) * 100)Shell helpers:
# top command (quick view)
top -bn1 | head -20
# mpstat (requires sysstat)
mpstat -P ALL 1 5
# Simple Bash script to compute usage
CPU_IDLE=$(top -bn1 | grep "Cpu(s)" | awk '{print $8}' | sed 's/%id,//')
CPU_USED=$(echo "100 - $CPU_IDLE" | bc -l)
echo "CPU usage: $CPU_USED%"2.2 Load Average
Load average reflects the number of runnable or running processes. Compare it with the number of CPU cores.
# Load values
node_load1 # 1‑minute
node_load5 # 5‑minute
node_load15 # 15‑minuteLoad < core count → CPU has headroom
Load ≈ core count → CPU is saturated
Load > core count → CPU overload, processes are waiting
# Normalized load (percentage of cores)
node_load1 / on (instance) count(node_cpu_seconds_total) * 100Threshold formulas (example for a host with N cores):
# Reasonable upper bound = N * 0.7 # Daily operation
# Warning threshold = N * 1.0 # Performance pressure
# Critical threshold = N * 1.5 # Immediate action required2.3 CPU Context Switches
High context‑switch rates consume CPU cycles and reduce efficiency.
# View context switches per second
vmstat 1 5
# Detailed stats
cat /proc/stat | grep -E "^(ctxt|nintr)"
# PromQL rate
rate(node_context_switches_total[5m])Too many processes/threads
I/O‑bound workloads causing frequent blocking
Lock contention
Improper scheduler parameters
2.4 CPU Alert Rules
# alerts_cpu.yml
groups:
- name: "CPU Alerts"
rules:
- alert: HostHighCpuLoad
expr: (100 - avg by (instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 80
for: 10m
labels:
severity: warning
annotations:
summary: "Host CPU usage high"
description: "Instance {{ $labels.instance }} CPU usage {{ $value | printf \"%.2f\" }}% for >10m"
- alert: HostCriticalCpuLoad
expr: (100 - avg by (instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 95
for: 5m
labels:
severity: critical
annotations:
summary: "Host CPU usage critical"
description: "Instance {{ $labels.instance }} CPU usage {{ $value | printf \"%.2f\" }}%"3. Memory Monitoring
3.1 Memory Utilization
Linux treats "available" memory as free + cache + buffers. Use MemAvailable for a realistic pressure metric.
# free -m example output
Mem: 32044 25631 6413 56 8454 12234
Swap: 8191 127 8064
# Usage formulas
memory_usage = (total - available) / total * 100
# Alternative (less accurate)
memory_usage = used / total * 100 # PromQL for usage based on available
100 - ((node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes) * 100)
# Detailed calculation using multiple counters
(node_memory_MemTotal_bytes - node_memory_MemFree_bytes - node_memory_Cached_bytes - node_memory_Buffers_bytes) / node_memory_MemTotal_bytes * 1003.2 Swap Usage
# Show swap usage
free -m
swapon -s
vmstat 1 5 # si/so columns show swap in/out per second
# Identify processes using swap
for f in /proc/*/status; do awk '/VmSwap/{s=$2}/Name/{n=$2}END{if(s>0)print n,s}' $f 2>/dev/null; done | sort -k2 -rn | head -10 # PromQL for swap usage percentage
(node_memory_SwapTotal_bytes - node_memory_SwapFree_bytes) / node_memory_SwapTotal_bytes * 100
# Alert when free swap < 1 GB
node_memory_SwapFree_bytes < 1024*1024*10243.3 Memory Alert Rules
# alerts_memory.yml
groups:
- name: "Memory Alerts"
rules:
- alert: HostHighMemoryUsage
expr: 100 - ((node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes) * 100) > 85
for: 15m
labels:
severity: warning
annotations:
summary: "Host memory usage high"
description: "Instance {{ $labels.instance }} memory usage {{ $value | printf \"%.2f\" }}%"
- alert: HostCriticalMemoryUsage
expr: 100 - ((node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes) * 100) > 95
for: 5m
labels:
severity: critical
annotations:
summary: "Host memory usage critical"
description: "Instance {{ $labels.instance }} memory usage {{ $value | printf \"%.2f\" }}%"
- alert: HostHighSwapUsage
expr: (node_memory_SwapTotal_bytes - node_memory_SwapFree_bytes) / node_memory_SwapTotal_bytes * 100 > 50
for: 10m
labels:
severity: warning
annotations:
summary: "Host swap usage high"
description: "Instance {{ $labels.instance }} swap usage {{ $value | printf \"%.2f\" }}%"
- alert: HostSwapSpaceExhausted
expr: node_memory_SwapFree_bytes < 104857600 # 100 MB
for: 1m
labels:
severity: critical
annotations:
summary: "Host swap space exhausted"
description: "Instance {{ $labels.instance }} free swap < 100 MB"3.4 Memory Leak Detection Script
#!/bin/bash
# detect_memory_leak.sh
INTERVAL=60 # seconds between samples
SAMPLES=10 # number of samples
OUTPUT="/tmp/memory_leak_detection.txt"
log(){ echo "[$(date '+%Y-%m-%d %H:%M:%S')] $1"; }
get_mem_usage(){ free | awk '/^Mem:/{print $3/$2 * 100}' ; }
log "Starting memory leak detection..."
echo "timestamp,usage%" > "$OUTPUT"
for i in $(seq 1 $SAMPLES); do
TS=$(date '+%Y-%m-%d %H:%M:%S')
USAGE=$(get_mem_usage)
echo "$TS,$USAGE" >> "$OUTPUT"
log "Sample $i/$SAMPLES: $USAGE%"
sleep $INTERVAL
done
FIRST=$(tail -n +2 "$OUTPUT" | head -1 | cut -d',' -f2)
LAST=$(tail -1 "$OUTPUT" | cut -d',' -f2)
GROWTH=$(echo "$LAST - $FIRST" | bc -l)
log "Memory usage change: $FIRST% → $LAST% (growth $GROWTH%)"
if (( $(echo "$GROWTH > 10" | bc -l) )); then
log "[WARNING] Memory usage continuously increasing, possible leak"
else
log "Memory usage change within normal range"
fi
log "Detailed data saved to $OUTPUT"4. Disk Monitoring
4.1 Disk Capacity
# Show disk usage
df -h
# Show inode usage (often overlooked)
df -i
# Filter out temporary filesystems
df -h --output=source,size,used,avail,pcent,target -x tmpfs -x devtmpfs -x overlay -x shmSample Bash script to warn/critical based on usage thresholds:
#!/bin/bash
WARN_THRESHOLD=80
CRIT_THRESHOLD=90
while read LINE; do
USAGE=$(echo "$LINE" | awk '{print $5}' | sed 's/%//')
MOUNT=$(echo "$LINE" | awk '{print $6}')
DEVICE=$(echo "$LINE" | awk '{print $1}')
if [ "$USAGE" -ge $CRIT_THRESHOLD ]; then
echo "[CRITICAL] $DEVICE ($MOUNT) usage $USAGE%"
elif [ "$USAGE" -ge $WARN_THRESHOLD ]; then
echo "[WARNING] $DEVICE ($MOUNT) usage $USAGE%"
else
echo "[OK] $DEVICE ($MOUNT) usage $USAGE%"
fi
done < <(df -h | grep -vE "tmpfs|devtmpfs|overlay|shm" | tail -n +2)Inode exhaustion can block file creation even when space is free.
# PromQL for low inode availability (exclude pseudo filesystems)
(node_filesystem_files_free{fstype!~"tmpfs|fuse.lxcfs"} / node_filesystem_files{fstype!~"tmpfs|fuse.lxcfs"}) * 100 < 104.2 Disk I/O Monitoring
# iostat (requires sysstat)
iostat -x 1 5
# Columns of interest
# %util – device utilization (higher = busier)
# r/s, w/s – reads/writes per second
# rKB/s, wKB/s – throughput
# await – average I/O latency (ms)
# avgqu-sz – average queue length
# Per‑process I/O
iotop -ao
pidstat -d 1 5
# Low‑level stats
cat /proc/diskstats # PromQL examples for disk I/O
rate(node_disk_io_time_seconds_total[5m]) * 100 # I/O utilization %
rate(node_disk_io_time_weighted_seconds_total[5m]) * 1000 # Avg latency ms
rate(node_disk_reads_completed_total[5m]) # Reads per second
rate(node_disk_writes_completed_total[5m]) # Writes per second
rate(node_disk_read_bytes_total[5m]) / 1024 # Read KB/s
rate(node_disk_written_bytes_total[5m]) / 1024 # Write KB/s4.3 Disk I/O Script
#!/bin/bash
INTERVAL=5
COUNT=6
log(){ echo "[$(date '+%Y-%m-%d %H:%M:%S')] $1"; }
log "===== Disk I/O Monitoring ====="
log "Sampling $INTERVAL sec × $COUNT times"
iostat -x $INTERVAL $COUNT | tail -n +7
log "Top devices by %util"
iostat -x $INTERVAL $COUNT | grep -E "^(Device|%util)" | paste - - | awk '{print $2, $3}' | sort -k2 -rn | head -5
log "Throughput ranking"
for dev in $(lsblk -nd --output NAME | grep -E "sd|nvme|vd"); do
READ=$(cat /sys/block/$dev/stat | awk '{print $6}')
WRITE=$(cat /sys/block/$dev/stat | awk '{print $10}')
echo "$dev: read sectors=$READ, write sectors=$WRITE"
done | sort -k3 -rn | head -54.4 Disk Alert Rules
# alerts_disk.yml
groups:
- name: "Disk Alerts"
rules:
- alert: HostDiskSpaceWarning
expr: (1 - node_filesystem_avail_bytes{fstype!~"tmpfs|fuse.lxcfs|fuse.cgroup"} / node_filesystem_size_bytes{fstype!~"tmpfs|fuse.lxcfs|fuse.cgroup"}) * 100 > 80
for: 10m
labels:
severity: warning
annotations:
summary: "Disk usage high"
description: "Mount {{ $labels.mountpoint }} usage {{ $value | printf \"%.1f\" }}%"
- alert: HostDiskSpaceCritical
expr: (1 - node_filesystem_avail_bytes{fstype!~"tmpfs|fuse.lxcfs|fuse.cgroup"} / node_filesystem_size_bytes{fstype!~"tmpfs|fuse.lxcfs|fuse.cgroup"}) * 100 > 95
for: 5m
labels:
severity: critical
annotations:
summary: "Disk space exhausted"
description: "Mount {{ $labels.mountpoint }} usage {{ $value | printf \"%.1f\" }}%"
- alert: HostDiskInodesWarning
expr: (1 - node_filesystem_files_free{fstype!~"tmpfs|fuse.lxcfs|fuse.cgroup"} / node_filesystem_files{fstype!~"tmpfs|fuse.lxcfs|fuse.cgroup"}) * 100 > 80
for: 30m
labels:
severity: warning
annotations:
summary: "Inode usage high"
description: "Mount {{ $labels.mountpoint }} inode usage {{ $value | printf \"%.1f\" }}%"
- alert: HostHighDiskIO
expr: rate(node_disk_io_time_seconds_total[5m]) * 100 > 70
for: 10m
labels:
severity: warning
annotations:
summary: "Disk I/O busy"
description: "Device {{ $labels.device }} utilization {{ $value | printf \"%.1f\" }}%"
- alert: HostHighIOwait
expr: (rate(node_cpu_seconds_total{mode="iowait"}[5m]) * 100) > 30
for: 15m
labels:
severity: warning
annotations:
summary: "CPU I/O wait high"
description: "Instance {{ $labels.instance }} I/O wait {{ $value | printf \"%.1f\" }}%"5. Network Monitoring
5.1 Bandwidth Utilization
# Show NIC speed (requires ethtool)
ethtool eth0 # Speed: 1000Mb/s, Duplex: Full
# Real‑time throughput
watch -n 1 "cat /proc/net/dev | grep -E 'eth0|ens33'"
# Or use ifstat
ifstat 1 5
# PromQL for bandwidth usage (assume 1 Gbps = 125 000 000 B/s)
rate(node_network_receive_bytes_total{device!="lo"}[5m]) / 125000000 * 100
rate(node_network_transmit_bytes_total{device!="lo"}[5m]) / 125000000 * 1005.2 Packet Loss and Errors
# NIC error counters
ip -s link show eth0
# TCP retransmission stats
netstat -s | grep -i retransmit
sar -n TCP 1 5
# SNMP counters for drops
cat /proc/net/snmp | grep -E "Ip:|Tcp:|Udp:"
# PromQL for receive error rate
rate(node_network_receive_errs_total{device!="lo"}[5m]) / rate(node_network_receive_packets_total{device!="lo"}[5m]) * 100
# Transmit error rate
rate(node_network_transmit_errs_total{device!="lo"}[5m]) / rate(node_network_transmit_packets_total{device!="lo"}[5m]) * 100Typical packet‑loss severity levels:
Normal : < 0.1 % – usual jitter
Acceptable : 0.1 % – 1 % – minor congestion
Warning : 1 % – 5 % – potential performance impact
Critical : > 5 % – severe network issue
5.3 TCP Connection State
# Distribution of TCP states (ss is preferred)
ss -ant | awk '{print $1}' | sort | uniq -c | sort -rn
# Meaning of common states
# LISTEN – listening sockets
# ESTABLISHED – active connections
# SYN_SENT – client sent SYN, waiting for ACK
# SYN_RECV – server received SYN, waiting for ACK
# FIN_WAIT1/2 – closing handshake
# TIME_WAIT – waiting for delayed packets to expire
# CLOSE – closed
# CLOSE_WAIT – received FIN, waiting for app close
# LAST_ACK – final ACK pending5.4 Network Alert Rules
# alerts_network.yml
groups:
- name: "Network Alerts"
rules:
- alert: HostHighNetworkBandwidthUsage
expr: (rate(node_network_receive_bytes_total{device!="lo"}[5m]) / 125000000) * 100 > 80
for: 10m
labels:
severity: warning
annotations:
summary: "Network bandwidth usage high"
description: "Device {{ $labels.device }} receive utilization {{ $value | printf \"%.1f\" }}%"
- alert: HostCriticalNetworkBandwidthUsage
expr: (rate(node_network_receive_bytes_total{device!="lo"}[5m]) / 125000000) * 100 > 95
for: 5m
labels:
severity: critical
annotations:
summary: "Network bandwidth approaching saturation"
description: "Device {{ $labels.device }} receive utilization {{ $value | printf \"%.1f\" }}%"
- alert: HostHighNetworkPacketLoss
expr: (rate(node_network_receive_drop_total{device!="lo"}[5m]) / rate(node_network_receive_packets_total{device!="lo"}[5m])) * 100 > 1
for: 5m
labels:
severity: warning
annotations:
summary: "Network packet loss high"
description: "Device {{ $labels.device }} loss rate {{ $value | printf \"%.2f\" }}%"
- alert: HostHighNetworkErrorRate
expr: (rate(node_network_receive_errs_total{device!="lo"}[5m]) / rate(node_network_receive_packets_total{device!="lo"}[5m])) * 100 > 0.1
for: 5m
labels:
severity: warning
annotations:
summary: "Network error rate high"
description: "Device {{ $labels.device }} error rate {{ $value | printf \"%.3f\" }}%"
- alert: HostHighTimeWaitConnections
expr: node_sockstat_TCP_tw > 50000
for: 10m
labels:
severity: warning
annotations:
summary: "Excessive TIME_WAIT connections"
description: "Instance {{ $labels.instance }} TIME_WAIT count {{ $value }} may exhaust ports"
- alert: HostHighTcpConnectionCount
expr: node_sockstat_sockets_used / node_sockstat_sockets_max_tw > 0.8
for: 10m
labels:
severity: warning
annotations:
summary: "TCP connections near limit"
description: "Instance {{ $labels.instance }} connection usage {{ $value | printf \"%.1f\" }}%"6. Comprehensive Health‑Check Script
#!/bin/bash
# host_health_check.sh – unified health report
set -euo pipefail
ALERT_MODE=${1:-summary}
RED='\033[0;31m'
GREEN='\033[0;32m'
YELLOW='\033[1;33m'
NC='\033[0m'
log_ok(){ echo -e "${GREEN}[OK]${NC} $1"; }
log_warn(){ echo -e "${YELLOW}[WARN]${NC} $1"; }
log_error(){ echo -e "${RED}[ERROR]${NC} $1"; }
check_cpu(){
echo "=== CPU Check ==="
CPU_IDLE=$(vmstat 1 2 | tail -1 | awk '{print $15}')
CPU_USAGE=$((100 - CPU_IDLE))
if [ $CPU_USAGE -gt 95 ]; then log_error "CPU usage $CPU_USAGE% (critical)";
elif [ $CPU_USAGE -gt 80 ]; then log_warn "CPU usage $CPU_USAGE% (warning)";
else log_ok "CPU usage $CPU_USAGE% (normal)"; fi
LOAD=$(awk '{print $1}' /proc/loadavg)
CORES=$(nproc)
LOAD_PCT=$((LOAD * 100 / CORES))
if [ $LOAD_PCT -gt 150 ]; then log_error "Load $LOAD (>1.5× cores)";
elif [ $LOAD_PCT -gt 100 ]; then log_warn "Load $LOAD (> cores)";
else log_ok "Load $LOAD (normal)"; fi
}
check_memory(){
echo "=== Memory Check ==="
TOTAL=$(free -m | awk '/^Mem:/{print $2}')
AVAILABLE=$(free -m | awk '/^Mem:/{print $7}')
USAGE_PCT=$(( (TOTAL - AVAILABLE) * 100 / TOTAL ))
if [ $USAGE_PCT -gt 95 ]; then log_error "Memory usage $USAGE_PCT% (critical)";
elif [ $USAGE_PCT -gt 85 ]; then log_warn "Memory usage $USAGE_PCT% (warning)";
else log_ok "Memory usage $USAGE_PCT% (normal)"; fi
SWAP_TOTAL=$(free -m | awk '/^Swap:/{print $2}')
SWAP_USED=$(free -m | awk '/^Swap:/{print $3}')
if [ $SWAP_TOTAL -gt 0 ]; then
SWAP_PCT=$((SWAP_USED * 100 / SWAP_TOTAL))
if [ $SWAP_PCT -gt 50 ]; then log_warn "Swap usage $SWAP_PCT% (pressure)"; fi
fi
}
check_disk(){
echo "=== Disk Check ==="
df -h | grep -vE "tmpfs|devtmpfs|overlay|shm" | tail -n +2 | while read LINE; do
USAGE=$(echo "$LINE" | awk '{print $5}' | sed 's/%//')
MOUNT=$(echo "$LINE" | awk '{print $6}')
if [ $USAGE -ge 95 ]; then log_error "$MOUNT: $USAGE% (critical)";
elif [ $USAGE -ge 85 ]; then log_warn "$MOUNT: $USAGE% (warning)";
else log_ok "$MOUNT: $USAGE% (normal)"; fi
done
# inode check
df -i | grep -vE "tmpfs|devtmpfs" | tail -n +2 | while read LINE; do
USAGE=$(echo "$LINE" | awk '{print $5}' | sed 's/%//')
MOUNT=$(echo "$LINE" | awk '{print $6}')
if [ $USAGE -ge 90 ]; then log_warn "$MOUNT inode usage $USAGE% (warning)"; fi
done
}
check_network(){
echo "=== Network Check ==="
ERRORS=$(ip -s link show | grep -E "errors" | awk '{print $3}' | paste -sd+ - | bc 2>/dev/null || echo 0)
if [ $ERRORS -gt 100 ]; then log_warn "Network error packets $ERRORS (possible issue)"; else log_ok "Network error packets $ERRORS (normal)"; fi
TW_COUNT=$(ss -ant state time-wait | wc -l)
CW_COUNT=$(ss -ant state close-wait | wc -l)
SYN_COUNT=$(ss -ant state syn-recv | wc -l)
[ $TW_COUNT -gt 50000 ] && log_warn "TIME_WAIT $TW_COUNT (excessive)"
[ $CW_COUNT -gt 100 ] && log_warn "CLOSE_WAIT $CW_COUNT (excessive)"
[ $SYN_COUNT -gt 1000 ] && log_error "SYN_RECV $SYN_COUNT (possible SYN flood)"
}
check_processes(){
echo "=== Process Check ==="
ZOMBIE_COUNT=$(ps aux | awk '$8 ~ /Z/{print}' | wc -l)
[ $ZOMBIE_COUNT -gt 0 ] && log_warn "Zombie processes $ZOMBIE_COUNT" || log_ok "Zombie processes: 0"
OOM_COUNT=$(dmesg | grep -i "invoked oom-killer" | wc -l)
[ $OOM_COUNT -gt 0 ] && log_error "OOM killer triggered $OOM_COUNT times"
}
check_services(){
echo "=== Service Check ==="
for svc in mysql nginx redis php-fpm docker; do
if systemctl is-active --quiet $svc 2>/dev/null; then log_ok "$svc: running";
else STATUS=$(systemctl is-active $svc 2>/dev/null || echo unknown); log_warn "$svc: $STATUS"; fi
done
}
main(){
echo "========================================"
echo "Host Health Check Report"
echo "Hostname: $(hostname)"
echo "Time: $(date '+%Y-%m-%d %H:%M:%S')"
echo "========================================"
check_cpu
check_memory
check_disk
check_network
check_processes
check_services
echo "========================================"
echo "Check complete"
echo "========================================"
}
main "$@"7. Prometheus Auto‑Discovery Configuration
# prometheus-sd.yml – static scrape config for node_exporter
scrape_configs:
- job_name: "node_exporter"
static_configs:
- targets: ["localhost:9100"]
relabel_configs:
- source_labels: [__address__]
target_label: instance
regex: "(.*):.*"
replacement: "${1}"
# Example file‑sd JSON for multiple hosts (saved as /etc/prometheus/targets/linux-hosts.yml)
- targets: ["192.168.1.10:9100", "192.168.1.11:9100"]
labels:
env: production
role: web8. Threshold Design Principles & Quick Reference
8.1 Design Principles
Business‑driven – thresholds must reflect real workload patterns, not generic defaults.
Safety buffers – alert before resources are exhausted (e.g., disk warning at 85 % instead of 95 %).
Multi‑level alerts – warning → critical → emergency to guide response priority.
Periodic review – revisit thresholds each quarter as traffic and architecture evolve.
8.2 Quick Reference (converted from table)
CPU usage : warning > 80 %, critical > 95 % (formula = 100 – idle %).
Load average : warning > 1 × cores, critical > 1.5 × cores (normalized load = load / cores).
Memory usage : warning > 85 %, critical > 95 % (formula = (total – available) / total).
Swap usage : warning > 50 %, critical > 80 % (used / total).
Disk usage : warning > 80 %, critical > 95 % (used / total).
Inode usage : warning > 80 %, critical > 90 % (used / total).
Disk I/O utilization : warning > 70 %, critical > 90 % (iostat %util).
I/O wait : warning > 20 %, critical > 50 % (iowait / total CPU).
Network bandwidth : warning > 70 %, critical > 90 % (throughput / link capacity).
Packet loss : warning > 1 %, critical > 5 % (dropped / total).
TCP TIME_WAIT : warning > 50 000, critical > 100 000 (absolute count).
TCP SYN_RECV : warning > 1 000, critical > 5 000 (absolute count).
8.3 Common Pitfalls
Monitoring only CPU and ignoring I/O – high CPU may be caused by I/O wait.
Treating high memory usage as a problem – Linux caches aggressively; use MemAvailable instead of MemFree.
Overlooking inode exhaustion – many small files can fill inodes before disk space runs out.
Ignoring packet loss when bandwidth looks fine – loss degrades application performance.
Setting static thresholds and never revisiting them – leads to alert fatigue during traffic spikes.
9. Monitoring Best Practices
9.1 Solving Alert Fatigue
Group similar alerts, silence maintenance windows, and adopt SLO‑based alerts.
# Alertmanager grouping example
route:
group_by: ['alertname','instance']
group_wait: 30s
group_interval: 5m
repeat_interval: 4h
receiver: 'team-notifications'
routes:
- match:
severity: critical
receiver: 'critical-alerts'
group_wait: 10s
repeat_interval: 1h # Silence during scheduled maintenance (example YAML)
- matchers:
- name: instance
value: "db-master-.*"
starts_at: "2026-04-15T02:00:00Z"
ends_at: "2026-04-15T06:00:00Z"
created_by: "admin"
comment: "Database maintenance window" # SLO‑driven error‑rate alert (instead of raw CPU)
- alert: HighErrorRate
expr: |
sum(rate(http_requests_total{status=~"5.."}[5m])) /
sum(rate(http_requests_total[5m])) > 0.001
for: 5m
labels:
severity: warning
annotations:
summary: "Service error rate high"
description: "Service {{ $labels.service }} error rate {{ $value | humanizePercentage }} exceeds 0.1 % SLO"9.2 Automated Response Workflow
Use Alertmanager webhooks to trigger custom actions (e.g., log cleanup, connection pool restart).
# Minimal Go webhook receiver (alert-receiver/main.go)
package main
import (
"encoding/json"
"io/ioutil"
"log"
"net/http"
)
type Alert struct {
Labels map[string]string `json:"labels"`
Annotations map[string]string `json:"annotations"`
Status string `json:"status"`
}
type Payload struct { Alerts []Alert `json:"alerts"` }
func handleAlert(w http.ResponseWriter, r *http.Request) {
body, _ := ioutil.ReadAll(r.Body)
defer r.Body.Close()
var p Payload
json.Unmarshal(body, &p)
for _, a := range p.Alerts {
if a.Status == "firing" {
log.Printf("Alert: %s – %s", a.Labels["alertname"], a.Annotations["summary"])
switch a.Labels["alertname"] {
case "HostHighDiskSpaceUsage":
go cleanupOldLogs(a.Labels["instance"])
case "MySQLConnectionRefused":
go restartConnectionPool(a.Labels["service"])
}
}
}
}
func cleanupOldLogs(instance string) { log.Printf("Cleaning old logs on %s", instance) }
func restartConnectionPool(service string) { log.Printf("Restarting connection pool for %s", service) }
func main() {
http.HandleFunc("/alerts", handleAlert)
log.Fatal(http.ListenAndServe(":8080", nil))
}9.3 Monitoring as Code
Store all Prometheus rules, scrape configs, Grafana dashboards, and scripts in a Git repository. Example layout:
monitoring-config/
├── prometheus/
│ ├── rules/
│ │ ├── cpu_alerts.yml
│ │ ├── memory_alerts.yml
│ │ ├── disk_alerts.yml
│ │ └── network_alerts.yml
│ ├── scrape/
│ │ └── node_exporter.yml
│ └── prometheus.yml
├── alertmanager/
│ └── alertmanager.yml
├── grafana/
│ └── dashboards/
│ ├── node-overview.json
│ └── service-health.json
└── Makefile # Validate configs
validate-configs:
promtool check config prometheus/prometheus.yml
promtool check rules prometheus/rules/*.yml
amtool check-config alertmanager/alertmanager.yml
# Apply to Kubernetes
apply-configs:
kubectl apply -f prometheus/rules/
kubectl apply -f prometheus/scrape/
kubectl apply -f alertmanager/9.4 Dashboard Design Principles
Place critical health status at the top, use color‑coded gauges, and keep charts focused on trends.
{
"dashboard": {
"panels": [
{
"title": "Service health (top)",
"type": "stat",
"targets": [{"expr": "up", "legendFormat": "{{instance}}"}],
"fieldConfig": {"defaults": {"mappings": [{"type": "value", "options": {"0": {"text": "Down", "color": "red"}, "1": {"text": "Up", "color": "green"}}}]}}
},
{
"title": "CPU usage",
"type": "gauge",
"targets": [{"expr": "100 - avg(rate(node_cpu_seconds_total{mode='idle'}[5m])) * 100"}],
"fieldConfig": {"defaults": {"thresholds": {"mode": "absolute", "steps": [{"value": 0, "color": "green"}, {"value": 70, "color": "yellow"}, {"value": 85, "color": "red"}]}}}
},
{
"title": "Memory usage",
"type": "gauge",
"targets": [{"expr": "100 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes * 100)"}]
},
{
"title": "Network bandwidth",
"type": "graph",
"targets": [
{"expr": "rate(node_network_receive_bytes_total{device!='lo'}[5m]) / 1024", "legendFormat": "RX KB/s"},
{"expr": "rate(node_network_transmit_bytes_total{device!='lo'}[5m]) / 1024", "legendFormat": "TX KB/s"}
]
}
]
}
}9.5 Production Inspection Script
#!/bin/bash
# prometheus_health_check.sh – verify Prometheus & Alertmanager health
set -euo pipefail
PROM_URL="http://localhost:9090"
AM_URL="http://localhost:9093"
RED='\033[0;31m'
GREEN='\033[0;32m'
YELLOW='\033[1;33m'
NC='\033[0m'
log_ok(){ echo -e "${GREEN}[OK]${NC} $1"; }
log_warn(){ echo -e "${YELLOW}[WARN]${NC} $1"; }
log_error(){ echo -e "${RED}[ERROR]${NC} $1"; }
echo "=== Prometheus Health Check ==="
# 1. Prometheus service
if curl -s "$PROM_URL/-/healthy" > /dev/null; then log_ok "Prometheus is healthy"; else log_error "Prometheus not responding"; fi
# 2. Target status
TARGET_DOWN=$(curl -s "$PROM_URL/api/v1/targets" | python3 -c "import sys,json; d=json.load(sys.stdin); print(len([t for t in d['data']['activeTargets'] if t['health']!='up']))" 2>/dev/null || echo "?")
if [ "$TARGET_DOWN" = "0" ]; then log_ok "All targets up"; elif [ $TARGET_DOWN -lt 3 ]; then log_warn "$TARGET_DOWN targets down"; else log_error "$TARGET_DOWN targets down"; fi
# 3. Alertmanager service
if curl -s "$AM_URL/-/healthy" > /dev/null; then log_ok "Alertmanager is healthy"; else log_error "Alertmanager not responding"; fi
# 4. Alert statistics
ALERT_COUNT=$(curl -s "$PROM_URL/api/v1/alerts" | python3 -c "import sys,json; d=json.load(sys.stdin); print(len(d['data']['alerts']))" 2>/dev/null || echo "?")
FIRING_COUNT=$(curl -s "$PROM_URL/api/v1/alerts" | python3 -c "import sys,json; d=json.load(sys.stdin); print(len([a for a in d['data']['alerts'] if a['state']=='firing']))" 2>/dev/null || echo "?")
echo "Total alerts: $ALERT_COUNT"
echo "Firing alerts: $FIRING_COUNT"
if [ $FIRING_COUNT -gt 0 ]; then log_warn "Active alerts present – investigate"; fi
# 5. TSDB size (series count)
TSDB_SIZE=$(curl -s "$PROM_URL/api/v1/status/tsdb" | python3 -c "import sys,json; d=json.load(sys.stdin); print(d['data']['headStats']['numSeries'])" 2>/dev/null || echo "?")
echo "Current series count: $TSDB_SIZE"
echo "=== Check complete ==="10. Conclusion
Designing monitoring thresholds is an iterative process. Operations engineers should adopt a "monitor → analyze → tune → verify" loop, keep thresholds version‑controlled, and bind each alert to a concrete runbook. This approach reduces alert fatigue, ensures alerts are meaningful, and enables rapid response to real incidents.
Ops Community
A leading IT operations community where professionals share and grow together.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
