Operations 22 min read

How to Build a Robust Linux Alert System with Automated Responses

This guide walks ops engineers through designing Linux monitoring metrics, configuring Prometheus alerts, implementing automated response scripts, integrating webhook handlers, visualizing data with Grafana, and applying performance tuning and fault‑recovery best practices to achieve reliable, self‑healing infrastructure.

MaGe Linux Operations

Jul 10, 2025

How to Build a Robust Linux Alert System with Automated Responses

Linux System Alert and Automated Response Configuration

Introduction

In modern IT operations, system monitoring and automated response are essential for service stability. Linux servers dominate enterprise environments, and their alert mechanisms and automation directly affect business continuity. This article explores configuration methods for Linux alerts and automated responses, providing practical solutions for ops engineers.

Monitoring Metric System

System Core Metrics

CPU Monitoring

CPU usage (overall and per core)

CPU load averages (1m, 5m, 15m)

CPU context switches

CPU interrupt handling count

Memory Monitoring

Memory usage and free memory

Swap usage

Memory fragmentation

Cache and buffer usage

Disk Monitoring

Disk space usage

Disk I/O read/write rates

Disk queue length

Filesystem inode usage

Network Monitoring

Network interface traffic

Number of network connections

Network error packet statistics

Network latency and packet loss

Application Layer Metrics

Process Monitoring

Key process alive status

Process CPU and memory consumption

Process file descriptor usage

Process port listening status

Service Monitoring

Service response time

Service availability checks

Service error rate statistics

Service connection pool status

Alert System Architecture Design

Monitoring Data Collection Layer

System-level Monitoring Tool

Use node_exporter to collect system metrics:

# Install node_exporter
wget https://github.com/prometheus/node_exporter/releases/download/v1.6.1/node_exporter-1.6.1.linux-amd64.tar.gz
tar xvfz node_exporter-1.6.1.linux-amd64.tar.gz
sudo cp node_exporter-1.6.1.linux-amd64/node_exporter /usr/local/bin/

# Create systemd service
sudo tee /etc/systemd/system/node_exporter.service > /dev/null <<EOF
[Unit]
Description=Node Exporter
Wants=network-online.target
After=network-online.target

[Service]
User=prometheus
Group=prometheus
Type=simple
ExecStart=/usr/local/bin/node_exporter
Restart=always
RestartSec=3

[Install]
WantedBy=multi-user.target
EOF

sudo systemctl daemon-reload
sudo systemctl enable node_exporter
sudo systemctl start node_exporter

Custom Monitoring Scripts

Create a system health check script:

#!/bin/bash
# system_health_check.sh
CONFIG_FILE="/etc/monitoring/health_check.conf"
CPU_THRESHOLD=80
MEMORY_THRESHOLD=85
DISK_THRESHOLD=90
LOAD_THRESHOLD=10

if [ -f "$CONFIG_FILE" ]; then
    source "$CONFIG_FILE"
fi

check_cpu() {
    local cpu_usage=$(top -bn1 | grep "Cpu(s)" | awk '{print $2}' | awk -F'%' '{print $1}')
    if (( $(echo "$cpu_usage > $CPU_THRESHOLD" | bc -l) )); then
        echo "CRITICAL: CPU usage is ${cpu_usage}%"
        return 2
    elif (( $(echo "$cpu_usage > $((CPU_THRESHOLD - 10))" | bc -l) )); then
        echo "WARNING: CPU usage is ${cpu_usage}%"
        return 1
    fi
    return 0
}
# Similar functions for memory, disk, and load omitted for brevity
main() {
    local exit_code=0
    local timestamp=$(date '+%Y-%m-%d %H:%M:%S')
    echo "[$timestamp] Starting system health check..."
    check_cpu
    local cpu_result=$?
    # Call other checks...
    if [ $cpu_result -eq 2 ]; then
        exit_code=2
    elif [ $cpu_result -eq 1 ]; then
        exit_code=1
    fi
    echo "[$timestamp] Health check completed with exit code: $exit_code"
    exit $exit_code
}
main "$@"

Alert Rule Configuration

Prometheus Alert Rules

# /etc/prometheus/rules/system_alerts.yml
groups:
  - name: system_alerts
    rules:
      - alert: HighCPUUsage
        expr: 100 - (avg by(instance) (irate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 80
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "High CPU usage detected"
          description: "CPU usage is above 80% for more than 5 minutes on {{ $labels.instance }}"
      - alert: HighMemoryUsage
        expr: (node_memory_MemTotal_bytes - node_memory_MemAvailable_bytes) * 100 / node_memory_MemTotal_bytes > 85
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "High memory usage detected"
          description: "Memory usage is above 85% for more than 5 minutes on {{ $labels.instance }}"
      - alert: DiskSpaceLow
        expr: (node_filesystem_avail_bytes{fstype!="tmpfs"} / node_filesystem_size_bytes{fstype!="tmpfs"}) * 100 < 10
        for: 1m
        labels:
          severity: critical
        annotations:
          summary: "Low disk space"
          description: "Disk space is below 10% on {{ $labels.instance }}"
      - alert: SystemLoadHigh
        expr: node_load1 > 10
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "High system load"
          description: "System load is above 10 for more than 5 minutes on {{ $labels.instance }}"
      - alert: ServiceDown
        expr: up == 0
        for: 1m
        labels:
          severity: critical
        annotations:
          summary: "Service is down"
          description: "{{ $labels.instance }} has been down for more than 1 minute"

Automated Response Mechanism

Response Strategy Classification

Preventive Response

Resource pre-allocation

Load balancing adjustment

Cache warming

Connection pool expansion

Corrective Response

Service restart

Process cleanup

Temporary file cleanup

Log rotation

Scalable Response

Auto scaling

Resource migration

Load shedding

Backup activation

Automation Script Implementation

Service Auto-restart Script

#!/bin/bash
# auto_restart_service.sh
SERVICE_NAME="$1"
LOG_FILE="/var/log/auto_restart.log"
MAX_RESTART_COUNT=3
RESTART_INTERVAL=60

check_service_status() {
    systemctl is-active --quiet "$SERVICE_NAME"
    return $?
}
log_message() {
    echo "$(date '+%Y-%m-%d %H:%M:%S') - $1" >> "$LOG_FILE"
}
send_notification() {
    local message="$1"
    local severity="$2"
    echo "$message" | mail -s "Service Alert: $SERVICE_NAME" [email protected]
    curl -X POST https://oapi.dingtalk.com/robot/send -H 'Content-Type: application/json' -d "{\"msgtype\": \"text\", \"text\": {\"content\": \"$message\"}}"
}
main() {
    local restart_count=0
    while [ $restart_count -lt $MAX_RESTART_COUNT ]; do
        if check_service_status; then
            log_message "Service $SERVICE_NAME is running normally"
            exit 0
        else
            restart_count=$((restart_count + 1))
            log_message "Attempting to restart $SERVICE_NAME (attempt $restart_count/$MAX_RESTART_COUNT)"
            systemctl restart "$SERVICE_NAME"
            sleep $RESTART_INTERVAL
            if check_service_status; then
                log_message "Successfully restarted $SERVICE_NAME"
                send_notification "Service $SERVICE_NAME has been successfully restarted" "INFO"
                exit 0
            fi
        fi
    done
    log_message "Failed to restart $SERVICE_NAME after $MAX_RESTART_COUNT attempts"
    send_notification "CRITICAL: Failed to restart $SERVICE_NAME after $MAX_RESTART_COUNT attempts" "CRITICAL"
    exit 1
}
main "$@"

Disk Space Auto-cleanup Script

#!/bin/bash
# disk_cleanup.sh
CLEANUP_PATHS=("/var/log" "/tmp" "/var/tmp" "/var/cache")
LOG_RETENTION_DAYS=7
TEMP_FILE_AGE=7

cleanup_logs() {
    local log_path="$1"
    find "$log_path" -name "*.log" -type f -mtime +$LOG_RETENTION_DAYS -delete
    find "$log_path" -name "*.log.*" -type f -mtime +$LOG_RETENTION_DAYS -delete
}
cleanup_temp() {
    local temp_path="$1"
    find "$temp_path" -type f -mtime +$TEMP_FILE_AGE -delete
    find "$temp_path" -type d -empty -delete
}
cleanup_cache() {
    if command -v apt-get &>/dev/null; then
        apt-get clean
    elif command -v yum &>/dev/null; then
        yum clean all
    fi
    sync && echo 3 > /proc/sys/vm/drop_caches
}
main() {
    echo "Starting disk cleanup process..."
    for path in "${CLEANUP_PATHS[@]}"; do
        if [ -d "$path" ]; then
            echo "Cleaning up $path..."
            case "$path" in
                "/var/log") cleanup_logs "$path" ;;
                "/tmp"|"/var/tmp") cleanup_temp "$path" ;;
                "/var/cache") cleanup_cache ;;
            esac
        fi
    done
    echo "Disk cleanup completed"
}
main "$@"

Webhook Handler

Python Flask webhook to trigger automation based on Alertmanager alerts:

#!/usr/bin/env python3
# webhook_handler.py
from flask import Flask, request, jsonify
import subprocess, json, logging

app = Flask(__name__)
logging.basicConfig(level=logging.INFO)

AUTOMATION_MAPPING = {
    'HighCPUUsage': 'handle_high_cpu',
    'HighMemoryUsage': 'handle_high_memory',
    'DiskSpaceLow': 'handle_disk_space_low',
    'ServiceDown': 'handle_service_down'
}

def handle_high_cpu(alert_data):
    """Handle high CPU usage alert"""
    instance = alert_data.get('labels', {}).get('instance', '')
    logging.info(f"Handling high CPU usage for {instance}")
    subprocess.run(['/usr/local/bin/cpu_optimization.sh', instance])
    return {"status": "success", "action": "cpu_optimization"}

def handle_high_memory(alert_data):
    """Handle high memory usage alert"""
    instance = alert_data.get('labels', {}).get('instance', '')
    logging.info(f"Handling high memory usage for {instance}")
    subprocess.run(['/usr/local/bin/memory_cleanup.sh', instance])
    return {"status": "success", "action": "memory_cleanup"}

def handle_disk_space_low(alert_data):
    """Handle low disk space alert"""
    instance = alert_data.get('labels', {}).get('instance', '')
    logging.info(f"Handling low disk space for {instance}")
    subprocess.run(['/usr/local/bin/disk_cleanup.sh'])
    return {"status": "success", "action": "disk_cleanup"}

def handle_service_down(alert_data):
    """Handle service down alert"""
    instance = alert_data.get('labels', {}).get('instance', '')
    job = alert_data.get('labels', {}).get('job', '')
    logging.info(f"Handling service down for {job} on {instance}")
    subprocess.run(['/usr/local/bin/auto_restart_service.sh', job])
    return {"status": "success", "action": "service_restart"}

@app.route('/webhook', methods=['POST'])
def webhook():
    """Process Alertmanager webhook"""
    try:
        data = request.json
        alerts = data.get('alerts', [])
        responses = []
        for alert in alerts:
            alert_name = alert.get('labels', {}).get('alertname', '')
            if alert_name in AUTOMATION_MAPPING:
                handler_func = globals()[AUTOMATION_MAPPING[alert_name]]
                response = handler_func(alert)
                responses.append(response)
            else:
                logging.warning(f"No handler found for alert: {alert_name}")
        return jsonify({"responses": responses})
    except Exception as e:
        logging.error(f"Error processing webhook: {str(e)}")
        return jsonify({"error": str(e)}), 500

if __name__ == '__main__':
    app.run(host='0.0.0.0', port=9093)

Monitoring Data Visualization

Grafana Dashboard Configuration

JSON definition for a Linux monitoring dashboard (CPU, memory, disk usage):

{
  "dashboard": {
    "title": "Linux系统监控",
    "panels": [
      {
        "title": "CPU使用率",
        "type": "stat",
        "targets": [
          { "expr": "100 - (avg by(instance) (irate(node_cpu_seconds_total{mode=\"idle\"}[5m])) * 100)" }
        ]
      },
      {
        "title": "内存使用率",
        "type": "stat",
        "targets": [
          { "expr": "(node_memory_MemTotal_bytes - node_memory_MemAvailable_bytes) / node_memory_MemTotal_bytes * 100" }
        ]
      },
      {
        "title": "磁盘使用率",
        "type": "stat",
        "targets": [
          { "expr": "(node_filesystem_size_bytes{fstype!=\"tmpfs\"} - node_filesystem_avail_bytes{fstype!=\"tmpfs\"}) / node_filesystem_size_bytes{fstype!=\"tmpfs\"} * 100" }
        ]
      }
    ]
  }
}

Performance Optimization and Tuning

Monitoring Performance Optimization

Data Collection Optimization

Adjust collection intervals to balance accuracy and load

Use data compression to reduce storage

Implement data retention policies

Optimize query performance

Alert Optimization

Set reasonable thresholds to avoid false alarms

Implement alert suppression

Configure alert aggregation rules

Regularly review and adjust alert policies

System Resource Optimization

Memory Management

# memory_optimization.sh
# Clean page cache
sync && echo 1 > /proc/sys/vm/drop_caches
# Adjust swap usage
echo 10 > /proc/sys/vm/swappiness
# Optimize memory overcommit
echo 1 > /proc/sys/vm/overcommit_memory

Disk I/O Optimization

# disk_io_optimization.sh
# Set I/O scheduler
echo noop > /sys/block/sda/queue/scheduler
# Optimize filesystem mount options
mount -o remount,noatime,nodiratime /
# Adjust disk queue depth
echo 32 > /sys/block/sda/queue/nr_requests

Fault Handling and Recovery

Fault Classification

Hardware Faults

Automatic disk failover

Network fault auto-recovery

Memory fault isolation

Software Faults

Process crash auto-restart

Service dependency checks

Configuration file auto-restore

Network Faults

Connection retry mechanisms

Load balancer failover

DNS resolution handling

Recovery Strategy Implementation

# disaster_recovery.sh
BACKUP_DIR="/opt/backups"
CONFIG_BACKUP="${BACKUP_DIR}/configs"
DATA_BACKUP="${BACKUP_DIR}/data"

restore_configs() {
    echo "Restoring configuration files..."
    cp -r "$CONFIG_BACKUP"/* /etc/
    systemctl daemon-reload
    systemctl restart nginx
    systemctl restart mysql
    systemctl restart redis
}

restore_data() {
    echo "Restoring data..."
    mysql -u root -p < "${DATA_BACKUP}/mysql_backup.sql"
    rsync -av "${DATA_BACKUP}/files/" /var/www/html/
}

health_check() {
    echo "Performing health check..."
    systemctl status nginx
    systemctl status mysql
    systemctl status redis
    netstat -tuln | grep :80
    netstat -tuln | grep :3306
    netstat -tuln | grep :6379
}

main() {
    echo "Starting disaster recovery process..."
    restore_configs
    restore_data
    health_check
    echo "Disaster recovery completed"
}
main "$@"

Best Practices Summary

Monitoring Strategy

Layered Monitoring

Infrastructure layer: hardware, OS, network

Application layer: services, processes, business metrics

User experience layer: response time, availability, error rate

Alert Strategy

Set appropriate thresholds

Implement escalation mechanisms

Configure silencing and suppression

Periodically evaluate alert effectiveness

Automation Principles

Progressive Automation

Start with simple tasks

Gradually extend to complex scenarios

Maintain manual intervention capability

Establish rollback mechanisms

Security Considerations

Principle of least privilege

Audit all operations

Require manual confirmation for critical actions

Regular security assessments

Conclusion

Linux system alert and automated response configuration is a core skill for modern operations. By designing proper monitoring metrics, robust alert mechanisms, intelligent automation, and effective recovery strategies, engineers can greatly improve system stability and availability.

Ops engineers should select appropriate tools and policies based on business needs, gradually build a comprehensive automation framework, and foster team collaboration to handle challenges efficiently.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Written by

MaGe Linux Operations

Founded in 2009, MaGe Education is a top Chinese high‑end IT training brand. Its graduates earn 12K+ RMB salaries, and the school has trained tens of thousands of students. It offers high‑pay courses in Linux cloud operations, Python full‑stack, automation, data analysis, AI, and Go high‑concurrency architecture. Thanks to quality courses and a solid reputation, it has talent partnerships with numerous internet firms.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.