How to Build a Robust Linux Alert System with Automated Responses
This guide walks ops engineers through designing Linux monitoring metrics, configuring Prometheus alerts, implementing automated response scripts, integrating webhook handlers, visualizing data with Grafana, and applying performance tuning and fault‑recovery best practices to achieve reliable, self‑healing infrastructure.
Linux System Alert and Automated Response Configuration
Introduction
In modern IT operations, system monitoring and automated response are essential for service stability. Linux servers dominate enterprise environments, and their alert mechanisms and automation directly affect business continuity. This article explores configuration methods for Linux alerts and automated responses, providing practical solutions for ops engineers.
Monitoring Metric System
System Core Metrics
CPU Monitoring
CPU usage (overall and per core)
CPU load averages (1m, 5m, 15m)
CPU context switches
CPU interrupt handling count
Memory Monitoring
Memory usage and free memory
Swap usage
Memory fragmentation
Cache and buffer usage
Disk Monitoring
Disk space usage
Disk I/O read/write rates
Disk queue length
Filesystem inode usage
Network Monitoring
Network interface traffic
Number of network connections
Network error packet statistics
Network latency and packet loss
Application Layer Metrics
Process Monitoring
Key process alive status
Process CPU and memory consumption
Process file descriptor usage
Process port listening status
Service Monitoring
Service response time
Service availability checks
Service error rate statistics
Service connection pool status
Alert System Architecture Design
Monitoring Data Collection Layer
System-level Monitoring Tool
Use node_exporter to collect system metrics:
# Install node_exporter
wget https://github.com/prometheus/node_exporter/releases/download/v1.6.1/node_exporter-1.6.1.linux-amd64.tar.gz
tar xvfz node_exporter-1.6.1.linux-amd64.tar.gz
sudo cp node_exporter-1.6.1.linux-amd64/node_exporter /usr/local/bin/
# Create systemd service
sudo tee /etc/systemd/system/node_exporter.service > /dev/null <<EOF
[Unit]
Description=Node Exporter
Wants=network-online.target
After=network-online.target
[Service]
User=prometheus
Group=prometheus
Type=simple
ExecStart=/usr/local/bin/node_exporter
Restart=always
RestartSec=3
[Install]
WantedBy=multi-user.target
EOF
sudo systemctl daemon-reload
sudo systemctl enable node_exporter
sudo systemctl start node_exporterCustom Monitoring Scripts
Create a system health check script:
#!/bin/bash
# system_health_check.sh
CONFIG_FILE="/etc/monitoring/health_check.conf"
CPU_THRESHOLD=80
MEMORY_THRESHOLD=85
DISK_THRESHOLD=90
LOAD_THRESHOLD=10
if [ -f "$CONFIG_FILE" ]; then
source "$CONFIG_FILE"
fi
check_cpu() {
local cpu_usage=$(top -bn1 | grep "Cpu(s)" | awk '{print $2}' | awk -F'%' '{print $1}')
if (( $(echo "$cpu_usage > $CPU_THRESHOLD" | bc -l) )); then
echo "CRITICAL: CPU usage is ${cpu_usage}%"
return 2
elif (( $(echo "$cpu_usage > $((CPU_THRESHOLD - 10))" | bc -l) )); then
echo "WARNING: CPU usage is ${cpu_usage}%"
return 1
fi
return 0
}
# Similar functions for memory, disk, and load omitted for brevity
main() {
local exit_code=0
local timestamp=$(date '+%Y-%m-%d %H:%M:%S')
echo "[$timestamp] Starting system health check..."
check_cpu
local cpu_result=$?
# Call other checks...
if [ $cpu_result -eq 2 ]; then
exit_code=2
elif [ $cpu_result -eq 1 ]; then
exit_code=1
fi
echo "[$timestamp] Health check completed with exit code: $exit_code"
exit $exit_code
}
main "$@"Alert Rule Configuration
Prometheus Alert Rules
# /etc/prometheus/rules/system_alerts.yml
groups:
- name: system_alerts
rules:
- alert: HighCPUUsage
expr: 100 - (avg by(instance) (irate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 80
for: 5m
labels:
severity: warning
annotations:
summary: "High CPU usage detected"
description: "CPU usage is above 80% for more than 5 minutes on {{ $labels.instance }}"
- alert: HighMemoryUsage
expr: (node_memory_MemTotal_bytes - node_memory_MemAvailable_bytes) * 100 / node_memory_MemTotal_bytes > 85
for: 5m
labels:
severity: warning
annotations:
summary: "High memory usage detected"
description: "Memory usage is above 85% for more than 5 minutes on {{ $labels.instance }}"
- alert: DiskSpaceLow
expr: (node_filesystem_avail_bytes{fstype!="tmpfs"} / node_filesystem_size_bytes{fstype!="tmpfs"}) * 100 < 10
for: 1m
labels:
severity: critical
annotations:
summary: "Low disk space"
description: "Disk space is below 10% on {{ $labels.instance }}"
- alert: SystemLoadHigh
expr: node_load1 > 10
for: 5m
labels:
severity: warning
annotations:
summary: "High system load"
description: "System load is above 10 for more than 5 minutes on {{ $labels.instance }}"
- alert: ServiceDown
expr: up == 0
for: 1m
labels:
severity: critical
annotations:
summary: "Service is down"
description: "{{ $labels.instance }} has been down for more than 1 minute"Automated Response Mechanism
Response Strategy Classification
Preventive Response
Resource pre-allocation
Load balancing adjustment
Cache warming
Connection pool expansion
Corrective Response
Service restart
Process cleanup
Temporary file cleanup
Log rotation
Scalable Response
Auto scaling
Resource migration
Load shedding
Backup activation
Automation Script Implementation
Service Auto-restart Script
#!/bin/bash
# auto_restart_service.sh
SERVICE_NAME="$1"
LOG_FILE="/var/log/auto_restart.log"
MAX_RESTART_COUNT=3
RESTART_INTERVAL=60
check_service_status() {
systemctl is-active --quiet "$SERVICE_NAME"
return $?
}
log_message() {
echo "$(date '+%Y-%m-%d %H:%M:%S') - $1" >> "$LOG_FILE"
}
send_notification() {
local message="$1"
local severity="$2"
echo "$message" | mail -s "Service Alert: $SERVICE_NAME" [email protected]
curl -X POST https://oapi.dingtalk.com/robot/send -H 'Content-Type: application/json' -d "{\"msgtype\": \"text\", \"text\": {\"content\": \"$message\"}}"
}
main() {
local restart_count=0
while [ $restart_count -lt $MAX_RESTART_COUNT ]; do
if check_service_status; then
log_message "Service $SERVICE_NAME is running normally"
exit 0
else
restart_count=$((restart_count + 1))
log_message "Attempting to restart $SERVICE_NAME (attempt $restart_count/$MAX_RESTART_COUNT)"
systemctl restart "$SERVICE_NAME"
sleep $RESTART_INTERVAL
if check_service_status; then
log_message "Successfully restarted $SERVICE_NAME"
send_notification "Service $SERVICE_NAME has been successfully restarted" "INFO"
exit 0
fi
fi
done
log_message "Failed to restart $SERVICE_NAME after $MAX_RESTART_COUNT attempts"
send_notification "CRITICAL: Failed to restart $SERVICE_NAME after $MAX_RESTART_COUNT attempts" "CRITICAL"
exit 1
}
main "$@"Disk Space Auto-cleanup Script
#!/bin/bash
# disk_cleanup.sh
CLEANUP_PATHS=("/var/log" "/tmp" "/var/tmp" "/var/cache")
LOG_RETENTION_DAYS=7
TEMP_FILE_AGE=7
cleanup_logs() {
local log_path="$1"
find "$log_path" -name "*.log" -type f -mtime +$LOG_RETENTION_DAYS -delete
find "$log_path" -name "*.log.*" -type f -mtime +$LOG_RETENTION_DAYS -delete
}
cleanup_temp() {
local temp_path="$1"
find "$temp_path" -type f -mtime +$TEMP_FILE_AGE -delete
find "$temp_path" -type d -empty -delete
}
cleanup_cache() {
if command -v apt-get &>/dev/null; then
apt-get clean
elif command -v yum &>/dev/null; then
yum clean all
fi
sync && echo 3 > /proc/sys/vm/drop_caches
}
main() {
echo "Starting disk cleanup process..."
for path in "${CLEANUP_PATHS[@]}"; do
if [ -d "$path" ]; then
echo "Cleaning up $path..."
case "$path" in
"/var/log") cleanup_logs "$path" ;;
"/tmp"|"/var/tmp") cleanup_temp "$path" ;;
"/var/cache") cleanup_cache ;;
esac
fi
done
echo "Disk cleanup completed"
}
main "$@"Webhook Handler
Python Flask webhook to trigger automation based on Alertmanager alerts:
#!/usr/bin/env python3
# webhook_handler.py
from flask import Flask, request, jsonify
import subprocess, json, logging
app = Flask(__name__)
logging.basicConfig(level=logging.INFO)
AUTOMATION_MAPPING = {
'HighCPUUsage': 'handle_high_cpu',
'HighMemoryUsage': 'handle_high_memory',
'DiskSpaceLow': 'handle_disk_space_low',
'ServiceDown': 'handle_service_down'
}
def handle_high_cpu(alert_data):
"""Handle high CPU usage alert"""
instance = alert_data.get('labels', {}).get('instance', '')
logging.info(f"Handling high CPU usage for {instance}")
subprocess.run(['/usr/local/bin/cpu_optimization.sh', instance])
return {"status": "success", "action": "cpu_optimization"}
def handle_high_memory(alert_data):
"""Handle high memory usage alert"""
instance = alert_data.get('labels', {}).get('instance', '')
logging.info(f"Handling high memory usage for {instance}")
subprocess.run(['/usr/local/bin/memory_cleanup.sh', instance])
return {"status": "success", "action": "memory_cleanup"}
def handle_disk_space_low(alert_data):
"""Handle low disk space alert"""
instance = alert_data.get('labels', {}).get('instance', '')
logging.info(f"Handling low disk space for {instance}")
subprocess.run(['/usr/local/bin/disk_cleanup.sh'])
return {"status": "success", "action": "disk_cleanup"}
def handle_service_down(alert_data):
"""Handle service down alert"""
instance = alert_data.get('labels', {}).get('instance', '')
job = alert_data.get('labels', {}).get('job', '')
logging.info(f"Handling service down for {job} on {instance}")
subprocess.run(['/usr/local/bin/auto_restart_service.sh', job])
return {"status": "success", "action": "service_restart"}
@app.route('/webhook', methods=['POST'])
def webhook():
"""Process Alertmanager webhook"""
try:
data = request.json
alerts = data.get('alerts', [])
responses = []
for alert in alerts:
alert_name = alert.get('labels', {}).get('alertname', '')
if alert_name in AUTOMATION_MAPPING:
handler_func = globals()[AUTOMATION_MAPPING[alert_name]]
response = handler_func(alert)
responses.append(response)
else:
logging.warning(f"No handler found for alert: {alert_name}")
return jsonify({"responses": responses})
except Exception as e:
logging.error(f"Error processing webhook: {str(e)}")
return jsonify({"error": str(e)}), 500
if __name__ == '__main__':
app.run(host='0.0.0.0', port=9093)Monitoring Data Visualization
Grafana Dashboard Configuration
JSON definition for a Linux monitoring dashboard (CPU, memory, disk usage):
{
"dashboard": {
"title": "Linux系统监控",
"panels": [
{
"title": "CPU使用率",
"type": "stat",
"targets": [
{ "expr": "100 - (avg by(instance) (irate(node_cpu_seconds_total{mode=\"idle\"}[5m])) * 100)" }
]
},
{
"title": "内存使用率",
"type": "stat",
"targets": [
{ "expr": "(node_memory_MemTotal_bytes - node_memory_MemAvailable_bytes) / node_memory_MemTotal_bytes * 100" }
]
},
{
"title": "磁盘使用率",
"type": "stat",
"targets": [
{ "expr": "(node_filesystem_size_bytes{fstype!=\"tmpfs\"} - node_filesystem_avail_bytes{fstype!=\"tmpfs\"}) / node_filesystem_size_bytes{fstype!=\"tmpfs\"} * 100" }
]
}
]
}
}Performance Optimization and Tuning
Monitoring Performance Optimization
Data Collection Optimization
Adjust collection intervals to balance accuracy and load
Use data compression to reduce storage
Implement data retention policies
Optimize query performance
Alert Optimization
Set reasonable thresholds to avoid false alarms
Implement alert suppression
Configure alert aggregation rules
Regularly review and adjust alert policies
System Resource Optimization
Memory Management
# memory_optimization.sh
# Clean page cache
sync && echo 1 > /proc/sys/vm/drop_caches
# Adjust swap usage
echo 10 > /proc/sys/vm/swappiness
# Optimize memory overcommit
echo 1 > /proc/sys/vm/overcommit_memoryDisk I/O Optimization
# disk_io_optimization.sh
# Set I/O scheduler
echo noop > /sys/block/sda/queue/scheduler
# Optimize filesystem mount options
mount -o remount,noatime,nodiratime /
# Adjust disk queue depth
echo 32 > /sys/block/sda/queue/nr_requestsFault Handling and Recovery
Fault Classification
Hardware Faults
Automatic disk failover
Network fault auto-recovery
Memory fault isolation
Software Faults
Process crash auto-restart
Service dependency checks
Configuration file auto-restore
Network Faults
Connection retry mechanisms
Load balancer failover
DNS resolution handling
Recovery Strategy Implementation
# disaster_recovery.sh
BACKUP_DIR="/opt/backups"
CONFIG_BACKUP="${BACKUP_DIR}/configs"
DATA_BACKUP="${BACKUP_DIR}/data"
restore_configs() {
echo "Restoring configuration files..."
cp -r "$CONFIG_BACKUP"/* /etc/
systemctl daemon-reload
systemctl restart nginx
systemctl restart mysql
systemctl restart redis
}
restore_data() {
echo "Restoring data..."
mysql -u root -p < "${DATA_BACKUP}/mysql_backup.sql"
rsync -av "${DATA_BACKUP}/files/" /var/www/html/
}
health_check() {
echo "Performing health check..."
systemctl status nginx
systemctl status mysql
systemctl status redis
netstat -tuln | grep :80
netstat -tuln | grep :3306
netstat -tuln | grep :6379
}
main() {
echo "Starting disaster recovery process..."
restore_configs
restore_data
health_check
echo "Disaster recovery completed"
}
main "$@"Best Practices Summary
Monitoring Strategy
Layered Monitoring
Infrastructure layer: hardware, OS, network
Application layer: services, processes, business metrics
User experience layer: response time, availability, error rate
Alert Strategy
Set appropriate thresholds
Implement escalation mechanisms
Configure silencing and suppression
Periodically evaluate alert effectiveness
Automation Principles
Progressive Automation
Start with simple tasks
Gradually extend to complex scenarios
Maintain manual intervention capability
Establish rollback mechanisms
Security Considerations
Principle of least privilege
Audit all operations
Require manual confirmation for critical actions
Regular security assessments
Conclusion
Linux system alert and automated response configuration is a core skill for modern operations. By designing proper monitoring metrics, robust alert mechanisms, intelligent automation, and effective recovery strategies, engineers can greatly improve system stability and availability.
Ops engineers should select appropriate tools and policies based on business needs, gradually build a comprehensive automation framework, and foster team collaboration to handle challenges efficiently.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
MaGe Linux Operations
Founded in 2009, MaGe Education is a top Chinese high‑end IT training brand. Its graduates earn 12K+ RMB salaries, and the school has trained tens of thousands of students. It offers high‑pay courses in Linux cloud operations, Python full‑stack, automation, data analysis, AI, and Go high‑concurrency architecture. Thanks to quality courses and a solid reputation, it has talent partnerships with numerous internet firms.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
