Operations 49 min read

Master Linux Host Monitoring: Prometheus, Node Exporter, Thresholds & Scripts

This comprehensive guide walks you through building a robust Linux host monitoring system with Prometheus and node_exporter, covering CPU, memory, disk, and network metrics, practical threshold formulas, ready‑to‑run Bash scripts, Alertmanager rules, Grafana dashboards, and best‑practice recommendations for reliable operations.

Ops Community
Ops Community
Ops Community
Master Linux Host Monitoring: Prometheus, Node Exporter, Thresholds & Scripts

Background

Monitoring is the eyes of operations. Without it, engineers react only after failures occur. Effective monitoring requires selecting the right metrics, setting sensible thresholds, and avoiding alert fatigue caused by noisy or missing data.

Prerequisites

Familiarity with Linux commands (top, free, df, iostat)

Understanding of the /proc filesystem

Basic knowledge of

systemd

Monitoring Architecture

[Monitored Host]          [Collection Layer]          [Storage/Visualization Layer]
                         node_exporter (9100)        Prometheus
CPU/Memory/Disk/Network → cAdvisor (8080) → Alertmanager
Base host metrics      → kube-state-metrics → Grafana
                         MySQL exporter (9104)
                         Redis exporter (9121)

The core exporter for Linux hosts is node_exporter, which gathers CPU, memory, disk, network, filesystem, and load metrics.

1. Install and Configure node_exporter

#!/bin/bash
EXPORTER_VERSION="1.8.2"
DOWNLOAD_URL="https://github.com/prometheus/node_exporter/releases/download/v${EXPORTER_VERSION}/node_exporter-${EXPORTER_VERSION}.linux-amd64.tar.gz"
cd /tmp
curl -LO "$DOWNLOAD_URL"
tar xzf node_exporter-${EXPORTER_VERSION}.linux-amd64.tar.gz
sudo mv node_exporter-${EXPORTER_VERSION}.linux-amd64/node_exporter /usr/local/bin/
sudo chmod +x /usr/local/bin/node_exporter
sudo useradd -rs /bin/false node_exporter || true
sudo tee /etc/systemd/system/node_exporter.service > /dev/null <<'EOF'
[Unit]
Description=Prometheus Node Exporter
After=network.target

[Service]
Type=simple
User=node_exporter
Group=node_exporter
ExecStart=/usr/local/bin/node_exporter \
    --collector.cpu \
    --collector.meminfo \
    --collector.diskstats \
    --collector.filesystem \
    --collector.netdev \
    --collector.loadavg \
    --collector.pressure \
    --collector.mpstat \
    --web.listen-address=:9100 \
    --web.disable-exporter-metrics
Restart=always

[Install]
WantedBy=multi-user.target
EOF
sudo systemctl daemon-reload
sudo systemctl enable --now node_exporter
echo "node_exporter started, listening on port 9100"

1.1 Quick Metric Reference

# View exported metrics locally
curl -s http://localhost:9100/metrics | grep -E "^(node_cpu|node_memory|node_disk|node_network|node_filesystem)" | head -50

# Common metric names
# node_cpu_seconds_total          # CPU time
# node_memory_MemTotal_bytes       # Total memory
# node_memory_MemAvailable_bytes   # Available memory
# node_disk_read_bytes_total       # Disk read bytes
# node_network_receive_bytes_total # Network receive bytes
# node_load1                       # 1‑minute load average

2. CPU Monitoring

2.1 CPU Utilization

CPU usage is the most direct performance indicator.

# Average CPU usage across all cores (percentage)
100 - (avg by (instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)

# Single‑core usage
100 - (rate(node_cpu_seconds_total{mode="idle"}[5m]) * 100)

Shell helpers:

# top command (quick view)
top -bn1 | head -20

# mpstat (requires sysstat)
mpstat -P ALL 1 5

# Simple Bash script to compute usage
CPU_IDLE=$(top -bn1 | grep "Cpu(s)" | awk '{print $8}' | sed 's/%id,//')
CPU_USED=$(echo "100 - $CPU_IDLE" | bc -l)
echo "CPU usage: $CPU_USED%"

2.2 Load Average

Load average reflects the number of runnable or running processes. Compare it with the number of CPU cores.

# Load values
node_load1   # 1‑minute
node_load5   # 5‑minute
node_load15  # 15‑minute

Load < core count → CPU has headroom

Load ≈ core count → CPU is saturated

Load > core count → CPU overload, processes are waiting

# Normalized load (percentage of cores)
node_load1 / on (instance) count(node_cpu_seconds_total) * 100

Threshold formulas (example for a host with N cores):

# Reasonable upper bound = N * 0.7   # Daily operation
# Warning threshold   = N * 1.0   # Performance pressure
# Critical threshold  = N * 1.5   # Immediate action required

2.3 CPU Context Switches

High context‑switch rates consume CPU cycles and reduce efficiency.

# View context switches per second
vmstat 1 5
# Detailed stats
cat /proc/stat | grep -E "^(ctxt|nintr)"
# PromQL rate
rate(node_context_switches_total[5m])

Too many processes/threads

I/O‑bound workloads causing frequent blocking

Lock contention

Improper scheduler parameters

2.4 CPU Alert Rules

# alerts_cpu.yml
groups:
- name: "CPU Alerts"
  rules:
  - alert: HostHighCpuLoad
    expr: (100 - avg by (instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 80
    for: 10m
    labels:
      severity: warning
    annotations:
      summary: "Host CPU usage high"
      description: "Instance {{ $labels.instance }} CPU usage {{ $value | printf \"%.2f\" }}% for >10m"
  - alert: HostCriticalCpuLoad
    expr: (100 - avg by (instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 95
    for: 5m
    labels:
      severity: critical
    annotations:
      summary: "Host CPU usage critical"
      description: "Instance {{ $labels.instance }} CPU usage {{ $value | printf \"%.2f\" }}%"

3. Memory Monitoring

3.1 Memory Utilization

Linux treats "available" memory as free + cache + buffers. Use MemAvailable for a realistic pressure metric.

# free -m example output
Mem:   32044   25631    6413   56   8454   12234
Swap:    8191    127    8064

# Usage formulas
memory_usage = (total - available) / total * 100
# Alternative (less accurate)
memory_usage = used / total * 100
# PromQL for usage based on available
100 - ((node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes) * 100)
# Detailed calculation using multiple counters
(node_memory_MemTotal_bytes - node_memory_MemFree_bytes - node_memory_Cached_bytes - node_memory_Buffers_bytes) / node_memory_MemTotal_bytes * 100

3.2 Swap Usage

# Show swap usage
free -m
swapon -s
vmstat 1 5   # si/so columns show swap in/out per second

# Identify processes using swap
for f in /proc/*/status; do awk '/VmSwap/{s=$2}/Name/{n=$2}END{if(s>0)print n,s}' $f 2>/dev/null; done | sort -k2 -rn | head -10
# PromQL for swap usage percentage
(node_memory_SwapTotal_bytes - node_memory_SwapFree_bytes) / node_memory_SwapTotal_bytes * 100
# Alert when free swap < 1 GB
node_memory_SwapFree_bytes < 1024*1024*1024

3.3 Memory Alert Rules

# alerts_memory.yml
groups:
- name: "Memory Alerts"
  rules:
  - alert: HostHighMemoryUsage
    expr: 100 - ((node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes) * 100) > 85
    for: 15m
    labels:
      severity: warning
    annotations:
      summary: "Host memory usage high"
      description: "Instance {{ $labels.instance }} memory usage {{ $value | printf \"%.2f\" }}%"
  - alert: HostCriticalMemoryUsage
    expr: 100 - ((node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes) * 100) > 95
    for: 5m
    labels:
      severity: critical
    annotations:
      summary: "Host memory usage critical"
      description: "Instance {{ $labels.instance }} memory usage {{ $value | printf \"%.2f\" }}%"
  - alert: HostHighSwapUsage
    expr: (node_memory_SwapTotal_bytes - node_memory_SwapFree_bytes) / node_memory_SwapTotal_bytes * 100 > 50
    for: 10m
    labels:
      severity: warning
    annotations:
      summary: "Host swap usage high"
      description: "Instance {{ $labels.instance }} swap usage {{ $value | printf \"%.2f\" }}%"
  - alert: HostSwapSpaceExhausted
    expr: node_memory_SwapFree_bytes < 104857600   # 100 MB
    for: 1m
    labels:
      severity: critical
    annotations:
      summary: "Host swap space exhausted"
      description: "Instance {{ $labels.instance }} free swap < 100 MB"

3.4 Memory Leak Detection Script

#!/bin/bash
# detect_memory_leak.sh
INTERVAL=60   # seconds between samples
SAMPLES=10    # number of samples
OUTPUT="/tmp/memory_leak_detection.txt"
log(){ echo "[$(date '+%Y-%m-%d %H:%M:%S')] $1"; }
get_mem_usage(){ free | awk '/^Mem:/{print $3/$2 * 100}' ; }
log "Starting memory leak detection..."
echo "timestamp,usage%" > "$OUTPUT"
for i in $(seq 1 $SAMPLES); do
  TS=$(date '+%Y-%m-%d %H:%M:%S')
  USAGE=$(get_mem_usage)
  echo "$TS,$USAGE" >> "$OUTPUT"
  log "Sample $i/$SAMPLES: $USAGE%"
  sleep $INTERVAL
done
FIRST=$(tail -n +2 "$OUTPUT" | head -1 | cut -d',' -f2)
LAST=$(tail -1 "$OUTPUT" | cut -d',' -f2)
GROWTH=$(echo "$LAST - $FIRST" | bc -l)
log "Memory usage change: $FIRST% → $LAST% (growth $GROWTH%)"
if (( $(echo "$GROWTH > 10" | bc -l) )); then
  log "[WARNING] Memory usage continuously increasing, possible leak"
else
  log "Memory usage change within normal range"
fi
log "Detailed data saved to $OUTPUT"

4. Disk Monitoring

4.1 Disk Capacity

# Show disk usage
df -h
# Show inode usage (often overlooked)
df -i
# Filter out temporary filesystems
df -h --output=source,size,used,avail,pcent,target -x tmpfs -x devtmpfs -x overlay -x shm

Sample Bash script to warn/critical based on usage thresholds:

#!/bin/bash
WARN_THRESHOLD=80
CRIT_THRESHOLD=90
while read LINE; do
  USAGE=$(echo "$LINE" | awk '{print $5}' | sed 's/%//')
  MOUNT=$(echo "$LINE" | awk '{print $6}')
  DEVICE=$(echo "$LINE" | awk '{print $1}')
  if [ "$USAGE" -ge $CRIT_THRESHOLD ]; then
    echo "[CRITICAL] $DEVICE ($MOUNT) usage $USAGE%"
  elif [ "$USAGE" -ge $WARN_THRESHOLD ]; then
    echo "[WARNING] $DEVICE ($MOUNT) usage $USAGE%"
  else
    echo "[OK] $DEVICE ($MOUNT) usage $USAGE%"
  fi
done < <(df -h | grep -vE "tmpfs|devtmpfs|overlay|shm" | tail -n +2)

Inode exhaustion can block file creation even when space is free.

# PromQL for low inode availability (exclude pseudo filesystems)
(node_filesystem_files_free{fstype!~"tmpfs|fuse.lxcfs"} / node_filesystem_files{fstype!~"tmpfs|fuse.lxcfs"}) * 100 < 10

4.2 Disk I/O Monitoring

# iostat (requires sysstat)
iostat -x 1 5
# Columns of interest
# %util – device utilization (higher = busier)
# r/s, w/s – reads/writes per second
# rKB/s, wKB/s – throughput
# await – average I/O latency (ms)
# avgqu-sz – average queue length
# Per‑process I/O
iotop -ao
pidstat -d 1 5
# Low‑level stats
cat /proc/diskstats
# PromQL examples for disk I/O
rate(node_disk_io_time_seconds_total[5m]) * 100               # I/O utilization %
rate(node_disk_io_time_weighted_seconds_total[5m]) * 1000    # Avg latency ms
rate(node_disk_reads_completed_total[5m])                    # Reads per second
rate(node_disk_writes_completed_total[5m])                   # Writes per second
rate(node_disk_read_bytes_total[5m]) / 1024                  # Read KB/s
rate(node_disk_written_bytes_total[5m]) / 1024               # Write KB/s

4.3 Disk I/O Script

#!/bin/bash
INTERVAL=5
COUNT=6
log(){ echo "[$(date '+%Y-%m-%d %H:%M:%S')] $1"; }
log "===== Disk I/O Monitoring ====="
log "Sampling $INTERVAL sec × $COUNT times"
iostat -x $INTERVAL $COUNT | tail -n +7
log "Top devices by %util"
iostat -x $INTERVAL $COUNT | grep -E "^(Device|%util)" | paste - - | awk '{print $2, $3}' | sort -k2 -rn | head -5
log "Throughput ranking"
for dev in $(lsblk -nd --output NAME | grep -E "sd|nvme|vd"); do
  READ=$(cat /sys/block/$dev/stat | awk '{print $6}')
  WRITE=$(cat /sys/block/$dev/stat | awk '{print $10}')
  echo "$dev: read sectors=$READ, write sectors=$WRITE"
done | sort -k3 -rn | head -5

4.4 Disk Alert Rules

# alerts_disk.yml
groups:
- name: "Disk Alerts"
  rules:
  - alert: HostDiskSpaceWarning
    expr: (1 - node_filesystem_avail_bytes{fstype!~"tmpfs|fuse.lxcfs|fuse.cgroup"} / node_filesystem_size_bytes{fstype!~"tmpfs|fuse.lxcfs|fuse.cgroup"}) * 100 > 80
    for: 10m
    labels:
      severity: warning
    annotations:
      summary: "Disk usage high"
      description: "Mount {{ $labels.mountpoint }} usage {{ $value | printf \"%.1f\" }}%"
  - alert: HostDiskSpaceCritical
    expr: (1 - node_filesystem_avail_bytes{fstype!~"tmpfs|fuse.lxcfs|fuse.cgroup"} / node_filesystem_size_bytes{fstype!~"tmpfs|fuse.lxcfs|fuse.cgroup"}) * 100 > 95
    for: 5m
    labels:
      severity: critical
    annotations:
      summary: "Disk space exhausted"
      description: "Mount {{ $labels.mountpoint }} usage {{ $value | printf \"%.1f\" }}%"
  - alert: HostDiskInodesWarning
    expr: (1 - node_filesystem_files_free{fstype!~"tmpfs|fuse.lxcfs|fuse.cgroup"} / node_filesystem_files{fstype!~"tmpfs|fuse.lxcfs|fuse.cgroup"}) * 100 > 80
    for: 30m
    labels:
      severity: warning
    annotations:
      summary: "Inode usage high"
      description: "Mount {{ $labels.mountpoint }} inode usage {{ $value | printf \"%.1f\" }}%"
  - alert: HostHighDiskIO
    expr: rate(node_disk_io_time_seconds_total[5m]) * 100 > 70
    for: 10m
    labels:
      severity: warning
    annotations:
      summary: "Disk I/O busy"
      description: "Device {{ $labels.device }} utilization {{ $value | printf \"%.1f\" }}%"
  - alert: HostHighIOwait
    expr: (rate(node_cpu_seconds_total{mode="iowait"}[5m]) * 100) > 30
    for: 15m
    labels:
      severity: warning
    annotations:
      summary: "CPU I/O wait high"
      description: "Instance {{ $labels.instance }} I/O wait {{ $value | printf \"%.1f\" }}%"

5. Network Monitoring

5.1 Bandwidth Utilization

# Show NIC speed (requires ethtool)
ethtool eth0   # Speed: 1000Mb/s, Duplex: Full
# Real‑time throughput
watch -n 1 "cat /proc/net/dev | grep -E 'eth0|ens33'"
# Or use ifstat
ifstat 1 5
# PromQL for bandwidth usage (assume 1 Gbps = 125 000 000 B/s)
rate(node_network_receive_bytes_total{device!="lo"}[5m]) / 125000000 * 100
rate(node_network_transmit_bytes_total{device!="lo"}[5m]) / 125000000 * 100

5.2 Packet Loss and Errors

# NIC error counters
ip -s link show eth0
# TCP retransmission stats
netstat -s | grep -i retransmit
sar -n TCP 1 5
# SNMP counters for drops
cat /proc/net/snmp | grep -E "Ip:|Tcp:|Udp:"
# PromQL for receive error rate
rate(node_network_receive_errs_total{device!="lo"}[5m]) / rate(node_network_receive_packets_total{device!="lo"}[5m]) * 100
# Transmit error rate
rate(node_network_transmit_errs_total{device!="lo"}[5m]) / rate(node_network_transmit_packets_total{device!="lo"}[5m]) * 100

Typical packet‑loss severity levels:

Normal : < 0.1 % – usual jitter

Acceptable : 0.1 % – 1 % – minor congestion

Warning : 1 % – 5 % – potential performance impact

Critical : > 5 % – severe network issue

5.3 TCP Connection State

# Distribution of TCP states (ss is preferred)
ss -ant | awk '{print $1}' | sort | uniq -c | sort -rn
# Meaning of common states
# LISTEN – listening sockets
# ESTABLISHED – active connections
# SYN_SENT – client sent SYN, waiting for ACK
# SYN_RECV – server received SYN, waiting for ACK
# FIN_WAIT1/2 – closing handshake
# TIME_WAIT – waiting for delayed packets to expire
# CLOSE – closed
# CLOSE_WAIT – received FIN, waiting for app close
# LAST_ACK – final ACK pending

5.4 Network Alert Rules

# alerts_network.yml
groups:
- name: "Network Alerts"
  rules:
  - alert: HostHighNetworkBandwidthUsage
    expr: (rate(node_network_receive_bytes_total{device!="lo"}[5m]) / 125000000) * 100 > 80
    for: 10m
    labels:
      severity: warning
    annotations:
      summary: "Network bandwidth usage high"
      description: "Device {{ $labels.device }} receive utilization {{ $value | printf \"%.1f\" }}%"
  - alert: HostCriticalNetworkBandwidthUsage
    expr: (rate(node_network_receive_bytes_total{device!="lo"}[5m]) / 125000000) * 100 > 95
    for: 5m
    labels:
      severity: critical
    annotations:
      summary: "Network bandwidth approaching saturation"
      description: "Device {{ $labels.device }} receive utilization {{ $value | printf \"%.1f\" }}%"
  - alert: HostHighNetworkPacketLoss
    expr: (rate(node_network_receive_drop_total{device!="lo"}[5m]) / rate(node_network_receive_packets_total{device!="lo"}[5m])) * 100 > 1
    for: 5m
    labels:
      severity: warning
    annotations:
      summary: "Network packet loss high"
      description: "Device {{ $labels.device }} loss rate {{ $value | printf \"%.2f\" }}%"
  - alert: HostHighNetworkErrorRate
    expr: (rate(node_network_receive_errs_total{device!="lo"}[5m]) / rate(node_network_receive_packets_total{device!="lo"}[5m])) * 100 > 0.1
    for: 5m
    labels:
      severity: warning
    annotations:
      summary: "Network error rate high"
      description: "Device {{ $labels.device }} error rate {{ $value | printf \"%.3f\" }}%"
  - alert: HostHighTimeWaitConnections
    expr: node_sockstat_TCP_tw > 50000
    for: 10m
    labels:
      severity: warning
    annotations:
      summary: "Excessive TIME_WAIT connections"
      description: "Instance {{ $labels.instance }} TIME_WAIT count {{ $value }} may exhaust ports"
  - alert: HostHighTcpConnectionCount
    expr: node_sockstat_sockets_used / node_sockstat_sockets_max_tw > 0.8
    for: 10m
    labels:
      severity: warning
    annotations:
      summary: "TCP connections near limit"
      description: "Instance {{ $labels.instance }} connection usage {{ $value | printf \"%.1f\" }}%"

6. Comprehensive Health‑Check Script

#!/bin/bash
# host_health_check.sh – unified health report
set -euo pipefail
ALERT_MODE=${1:-summary}
RED='\033[0;31m'
GREEN='\033[0;32m'
YELLOW='\033[1;33m'
NC='\033[0m'
log_ok(){ echo -e "${GREEN}[OK]${NC} $1"; }
log_warn(){ echo -e "${YELLOW}[WARN]${NC} $1"; }
log_error(){ echo -e "${RED}[ERROR]${NC} $1"; }

check_cpu(){
  echo "=== CPU Check ==="
  CPU_IDLE=$(vmstat 1 2 | tail -1 | awk '{print $15}')
  CPU_USAGE=$((100 - CPU_IDLE))
  if [ $CPU_USAGE -gt 95 ]; then log_error "CPU usage $CPU_USAGE% (critical)"; 
  elif [ $CPU_USAGE -gt 80 ]; then log_warn "CPU usage $CPU_USAGE% (warning)"; 
  else log_ok "CPU usage $CPU_USAGE% (normal)"; fi
  LOAD=$(awk '{print $1}' /proc/loadavg)
  CORES=$(nproc)
  LOAD_PCT=$((LOAD * 100 / CORES))
  if [ $LOAD_PCT -gt 150 ]; then log_error "Load $LOAD (>1.5× cores)"; 
  elif [ $LOAD_PCT -gt 100 ]; then log_warn "Load $LOAD (> cores)"; 
  else log_ok "Load $LOAD (normal)"; fi
}

check_memory(){
  echo "=== Memory Check ==="
  TOTAL=$(free -m | awk '/^Mem:/{print $2}')
  AVAILABLE=$(free -m | awk '/^Mem:/{print $7}')
  USAGE_PCT=$(( (TOTAL - AVAILABLE) * 100 / TOTAL ))
  if [ $USAGE_PCT -gt 95 ]; then log_error "Memory usage $USAGE_PCT% (critical)"; 
  elif [ $USAGE_PCT -gt 85 ]; then log_warn "Memory usage $USAGE_PCT% (warning)"; 
  else log_ok "Memory usage $USAGE_PCT% (normal)"; fi
  SWAP_TOTAL=$(free -m | awk '/^Swap:/{print $2}')
  SWAP_USED=$(free -m | awk '/^Swap:/{print $3}')
  if [ $SWAP_TOTAL -gt 0 ]; then
    SWAP_PCT=$((SWAP_USED * 100 / SWAP_TOTAL))
    if [ $SWAP_PCT -gt 50 ]; then log_warn "Swap usage $SWAP_PCT% (pressure)"; fi
  fi
}

check_disk(){
  echo "=== Disk Check ==="
  df -h | grep -vE "tmpfs|devtmpfs|overlay|shm" | tail -n +2 | while read LINE; do
    USAGE=$(echo "$LINE" | awk '{print $5}' | sed 's/%//')
    MOUNT=$(echo "$LINE" | awk '{print $6}')
    if [ $USAGE -ge 95 ]; then log_error "$MOUNT: $USAGE% (critical)"; 
    elif [ $USAGE -ge 85 ]; then log_warn "$MOUNT: $USAGE% (warning)"; 
    else log_ok "$MOUNT: $USAGE% (normal)"; fi
  done
  # inode check
  df -i | grep -vE "tmpfs|devtmpfs" | tail -n +2 | while read LINE; do
    USAGE=$(echo "$LINE" | awk '{print $5}' | sed 's/%//')
    MOUNT=$(echo "$LINE" | awk '{print $6}')
    if [ $USAGE -ge 90 ]; then log_warn "$MOUNT inode usage $USAGE% (warning)"; fi
  done
}

check_network(){
  echo "=== Network Check ==="
  ERRORS=$(ip -s link show | grep -E "errors" | awk '{print $3}' | paste -sd+ - | bc 2>/dev/null || echo 0)
  if [ $ERRORS -gt 100 ]; then log_warn "Network error packets $ERRORS (possible issue)"; else log_ok "Network error packets $ERRORS (normal)"; fi
  TW_COUNT=$(ss -ant state time-wait | wc -l)
  CW_COUNT=$(ss -ant state close-wait | wc -l)
  SYN_COUNT=$(ss -ant state syn-recv | wc -l)
  [ $TW_COUNT -gt 50000 ] && log_warn "TIME_WAIT $TW_COUNT (excessive)"
  [ $CW_COUNT -gt 100 ] && log_warn "CLOSE_WAIT $CW_COUNT (excessive)"
  [ $SYN_COUNT -gt 1000 ] && log_error "SYN_RECV $SYN_COUNT (possible SYN flood)"
}

check_processes(){
  echo "=== Process Check ==="
  ZOMBIE_COUNT=$(ps aux | awk '$8 ~ /Z/{print}' | wc -l)
  [ $ZOMBIE_COUNT -gt 0 ] && log_warn "Zombie processes $ZOMBIE_COUNT" || log_ok "Zombie processes: 0"
  OOM_COUNT=$(dmesg | grep -i "invoked oom-killer" | wc -l)
  [ $OOM_COUNT -gt 0 ] && log_error "OOM killer triggered $OOM_COUNT times"
}

check_services(){
  echo "=== Service Check ==="
  for svc in mysql nginx redis php-fpm docker; do
    if systemctl is-active --quiet $svc 2>/dev/null; then log_ok "$svc: running"; 
    else STATUS=$(systemctl is-active $svc 2>/dev/null || echo unknown); log_warn "$svc: $STATUS"; fi
  done
}

main(){
  echo "========================================"
  echo "Host Health Check Report"
  echo "Hostname: $(hostname)"
  echo "Time: $(date '+%Y-%m-%d %H:%M:%S')"
  echo "========================================"
  check_cpu
  check_memory
  check_disk
  check_network
  check_processes
  check_services
  echo "========================================"
  echo "Check complete"
  echo "========================================"
}

main "$@"

7. Prometheus Auto‑Discovery Configuration

# prometheus-sd.yml – static scrape config for node_exporter
scrape_configs:
- job_name: "node_exporter"
  static_configs:
  - targets: ["localhost:9100"]
  relabel_configs:
  - source_labels: [__address__]
    target_label: instance
    regex: "(.*):.*"
    replacement: "${1}"
# Example file‑sd JSON for multiple hosts (saved as /etc/prometheus/targets/linux-hosts.yml)
- targets: ["192.168.1.10:9100", "192.168.1.11:9100"]
  labels:
    env: production
    role: web

8. Threshold Design Principles & Quick Reference

8.1 Design Principles

Business‑driven – thresholds must reflect real workload patterns, not generic defaults.

Safety buffers – alert before resources are exhausted (e.g., disk warning at 85 % instead of 95 %).

Multi‑level alerts – warning → critical → emergency to guide response priority.

Periodic review – revisit thresholds each quarter as traffic and architecture evolve.

8.2 Quick Reference (converted from table)

CPU usage : warning > 80 %, critical > 95 % (formula = 100 – idle %).

Load average : warning > 1 × cores, critical > 1.5 × cores (normalized load = load / cores).

Memory usage : warning > 85 %, critical > 95 % (formula = (total – available) / total).

Swap usage : warning > 50 %, critical > 80 % (used / total).

Disk usage : warning > 80 %, critical > 95 % (used / total).

Inode usage : warning > 80 %, critical > 90 % (used / total).

Disk I/O utilization : warning > 70 %, critical > 90 % (iostat %util).

I/O wait : warning > 20 %, critical > 50 % (iowait / total CPU).

Network bandwidth : warning > 70 %, critical > 90 % (throughput / link capacity).

Packet loss : warning > 1 %, critical > 5 % (dropped / total).

TCP TIME_WAIT : warning > 50 000, critical > 100 000 (absolute count).

TCP SYN_RECV : warning > 1 000, critical > 5 000 (absolute count).

8.3 Common Pitfalls

Monitoring only CPU and ignoring I/O – high CPU may be caused by I/O wait.

Treating high memory usage as a problem – Linux caches aggressively; use MemAvailable instead of MemFree.

Overlooking inode exhaustion – many small files can fill inodes before disk space runs out.

Ignoring packet loss when bandwidth looks fine – loss degrades application performance.

Setting static thresholds and never revisiting them – leads to alert fatigue during traffic spikes.

9. Monitoring Best Practices

9.1 Solving Alert Fatigue

Group similar alerts, silence maintenance windows, and adopt SLO‑based alerts.

# Alertmanager grouping example
route:
  group_by: ['alertname','instance']
  group_wait: 30s
  group_interval: 5m
  repeat_interval: 4h
  receiver: 'team-notifications'
  routes:
  - match:
      severity: critical
    receiver: 'critical-alerts'
    group_wait: 10s
    repeat_interval: 1h
# Silence during scheduled maintenance (example YAML)
- matchers:
  - name: instance
    value: "db-master-.*"
  starts_at: "2026-04-15T02:00:00Z"
  ends_at: "2026-04-15T06:00:00Z"
  created_by: "admin"
  comment: "Database maintenance window"
# SLO‑driven error‑rate alert (instead of raw CPU)
- alert: HighErrorRate
  expr: |
    sum(rate(http_requests_total{status=~"5.."}[5m])) /
    sum(rate(http_requests_total[5m])) > 0.001
  for: 5m
  labels:
    severity: warning
  annotations:
    summary: "Service error rate high"
    description: "Service {{ $labels.service }} error rate {{ $value | humanizePercentage }} exceeds 0.1 % SLO"

9.2 Automated Response Workflow

Use Alertmanager webhooks to trigger custom actions (e.g., log cleanup, connection pool restart).

# Minimal Go webhook receiver (alert-receiver/main.go)
package main
import (
  "encoding/json"
  "io/ioutil"
  "log"
  "net/http"
)

type Alert struct {
  Labels      map[string]string `json:"labels"`
  Annotations map[string]string `json:"annotations"`
  Status      string            `json:"status"`
}

type Payload struct { Alerts []Alert `json:"alerts"` }

func handleAlert(w http.ResponseWriter, r *http.Request) {
  body, _ := ioutil.ReadAll(r.Body)
  defer r.Body.Close()
  var p Payload
  json.Unmarshal(body, &p)
  for _, a := range p.Alerts {
    if a.Status == "firing" {
      log.Printf("Alert: %s – %s", a.Labels["alertname"], a.Annotations["summary"])
      switch a.Labels["alertname"] {
      case "HostHighDiskSpaceUsage":
        go cleanupOldLogs(a.Labels["instance"])
      case "MySQLConnectionRefused":
        go restartConnectionPool(a.Labels["service"])
      }
    }
  }
}

func cleanupOldLogs(instance string) { log.Printf("Cleaning old logs on %s", instance) }
func restartConnectionPool(service string) { log.Printf("Restarting connection pool for %s", service) }

func main() {
  http.HandleFunc("/alerts", handleAlert)
  log.Fatal(http.ListenAndServe(":8080", nil))
}

9.3 Monitoring as Code

Store all Prometheus rules, scrape configs, Grafana dashboards, and scripts in a Git repository. Example layout:

monitoring-config/
├── prometheus/
│   ├── rules/
│   │   ├── cpu_alerts.yml
│   │   ├── memory_alerts.yml
│   │   ├── disk_alerts.yml
│   │   └── network_alerts.yml
│   ├── scrape/
│   │   └── node_exporter.yml
│   └── prometheus.yml
├── alertmanager/
│   └── alertmanager.yml
├── grafana/
│   └── dashboards/
│       ├── node-overview.json
│       └── service-health.json
└── Makefile
# Validate configs
validate-configs:
	promtool check config prometheus/prometheus.yml
	promtool check rules prometheus/rules/*.yml
	amtool check-config alertmanager/alertmanager.yml

# Apply to Kubernetes
apply-configs:
	kubectl apply -f prometheus/rules/
	kubectl apply -f prometheus/scrape/
	kubectl apply -f alertmanager/

9.4 Dashboard Design Principles

Place critical health status at the top, use color‑coded gauges, and keep charts focused on trends.

{
  "dashboard": {
    "panels": [
      {
        "title": "Service health (top)",
        "type": "stat",
        "targets": [{"expr": "up", "legendFormat": "{{instance}}"}],
        "fieldConfig": {"defaults": {"mappings": [{"type": "value", "options": {"0": {"text": "Down", "color": "red"}, "1": {"text": "Up", "color": "green"}}}]}}
      },
      {
        "title": "CPU usage",
        "type": "gauge",
        "targets": [{"expr": "100 - avg(rate(node_cpu_seconds_total{mode='idle'}[5m])) * 100"}],
        "fieldConfig": {"defaults": {"thresholds": {"mode": "absolute", "steps": [{"value": 0, "color": "green"}, {"value": 70, "color": "yellow"}, {"value": 85, "color": "red"}]}}}
      },
      {
        "title": "Memory usage",
        "type": "gauge",
        "targets": [{"expr": "100 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes * 100)"}]
      },
      {
        "title": "Network bandwidth",
        "type": "graph",
        "targets": [
          {"expr": "rate(node_network_receive_bytes_total{device!='lo'}[5m]) / 1024", "legendFormat": "RX KB/s"},
          {"expr": "rate(node_network_transmit_bytes_total{device!='lo'}[5m]) / 1024", "legendFormat": "TX KB/s"}
        ]
      }
    ]
  }
}

9.5 Production Inspection Script

#!/bin/bash
# prometheus_health_check.sh – verify Prometheus & Alertmanager health
set -euo pipefail
PROM_URL="http://localhost:9090"
AM_URL="http://localhost:9093"
RED='\033[0;31m'
GREEN='\033[0;32m'
YELLOW='\033[1;33m'
NC='\033[0m'
log_ok(){ echo -e "${GREEN}[OK]${NC} $1"; }
log_warn(){ echo -e "${YELLOW}[WARN]${NC} $1"; }
log_error(){ echo -e "${RED}[ERROR]${NC} $1"; }

echo "=== Prometheus Health Check ==="
# 1. Prometheus service
if curl -s "$PROM_URL/-/healthy" > /dev/null; then log_ok "Prometheus is healthy"; else log_error "Prometheus not responding"; fi
# 2. Target status
TARGET_DOWN=$(curl -s "$PROM_URL/api/v1/targets" | python3 -c "import sys,json; d=json.load(sys.stdin); print(len([t for t in d['data']['activeTargets'] if t['health']!='up']))" 2>/dev/null || echo "?")
if [ "$TARGET_DOWN" = "0" ]; then log_ok "All targets up"; elif [ $TARGET_DOWN -lt 3 ]; then log_warn "$TARGET_DOWN targets down"; else log_error "$TARGET_DOWN targets down"; fi
# 3. Alertmanager service
if curl -s "$AM_URL/-/healthy" > /dev/null; then log_ok "Alertmanager is healthy"; else log_error "Alertmanager not responding"; fi
# 4. Alert statistics
ALERT_COUNT=$(curl -s "$PROM_URL/api/v1/alerts" | python3 -c "import sys,json; d=json.load(sys.stdin); print(len(d['data']['alerts']))" 2>/dev/null || echo "?")
FIRING_COUNT=$(curl -s "$PROM_URL/api/v1/alerts" | python3 -c "import sys,json; d=json.load(sys.stdin); print(len([a for a in d['data']['alerts'] if a['state']=='firing']))" 2>/dev/null || echo "?")
echo "Total alerts: $ALERT_COUNT"
echo "Firing alerts: $FIRING_COUNT"
if [ $FIRING_COUNT -gt 0 ]; then log_warn "Active alerts present – investigate"; fi
# 5. TSDB size (series count)
TSDB_SIZE=$(curl -s "$PROM_URL/api/v1/status/tsdb" | python3 -c "import sys,json; d=json.load(sys.stdin); print(d['data']['headStats']['numSeries'])" 2>/dev/null || echo "?")
echo "Current series count: $TSDB_SIZE"

echo "=== Check complete ==="

10. Conclusion

Designing monitoring thresholds is an iterative process. Operations engineers should adopt a "monitor → analyze → tune → verify" loop, keep thresholds version‑controlled, and bind each alert to a concrete runbook. This approach reduces alert fatigue, ensures alerts are meaningful, and enables rapid response to real incidents.

PrometheusGrafanashell scriptingAlertmanagerLinux monitoringnode_exporterPerformance thresholds
Ops Community
Written by

Ops Community

A leading IT operations community where professionals share and grow together.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.