Master Enterprise Linux Disk Maintenance: From Monitoring to Recovery
This comprehensive guide walks operations engineers through enterprise‑level Linux disk maintenance, covering health monitoring, SMART implementation, automated scripts, preventive cleanup, fault grading, diagnosis procedures, data recovery techniques, performance tuning, and automation with Ansible and Prometheus, enabling proactive prevention and rapid response to storage issues.
Introduction
In enterprise Linux environments, disk failures cause most outages; over 70% of server failures are storage‑related. Operations engineers need a complete disk maintenance process to prevent failures and rescue systems.
This guide covers monitoring, alerts, fault handling, data recovery, and more.
Chapter 1: Disk Health Monitoring System
1.1 Basic Monitoring Metrics
Key hardware metrics: Disk temperature (<55°C), read/write error rate, reallocated sector count, uncorrectable sector count, disk utilization.
Performance metrics: IOPS, response time, queue depth, bandwidth utilization.
1.2 SMART Monitoring Implementation
# Install smartmontools
yum install smartmontools -y
# Check SMART status
smartctl -a /dev/sda
# Enable SMART self‑test
smartctl -s on /dev/sda
# Run short self‑test
smartctl -t short /dev/sda
# View self‑test results
smartctl -l selftest /dev/sda1.3 Automated Monitoring Script
#!/bin/bash
# disk_monitor.sh - Disk monitoring script
DISK_LIST="/dev/sda /dev/sdb /dev/sdc"
LOG_FILE="/var/log/disk_monitor.log"
ALERT_THRESHOLD=90
for disk in $DISK_LIST; do
# Check disk usage
usage=$(df -h $disk | tail -1 | awk '{print $5}' | sed 's/%//')
if [ $usage -gt $ALERT_THRESHOLD ]; then
echo "$(date): WARNING - $disk usage is ${usage}%" >> $LOG_FILE
# Send alert email
echo "Disk $disk usage reached ${usage}%" | mail -s "Disk Alert" [email protected]
fi
# Check SMART status
smart_status=$(smartctl -H $disk | grep "SMART overall-health")
if [[ $smart_status != *"PASSED"* ]]; then
echo "$(date): CRITICAL - $disk SMART check failed" >> $LOG_FILE
echo "Disk $disk SMART check failed, please handle immediately" | mail -s "Disk Critical Alert" [email protected]
fi
doneChapter 2: Preventive Maintenance Process
2.1 Regular Cleanup Strategy
Log Cleanup
# Clean system logs older than 30 days
find /var/log -name "*.log" -mtime +30 -exec rm {} \;
# Clean temporary files older than 7 days
find /tmp -type f -mtime +7 -exec rm {} \;
# Clean cache files older than 30 days
find /var/cache -type f -mtime +30 -exec rm {} \;Database Maintenance
# MySQL slow log cleanup
mysql -u root -p <<EOF
DELETE FROM mysql.slow_log WHERE start_time < DATE_SUB(NOW(), INTERVAL 30 DAY);
DELETE FROM mysql.general_log WHERE event_time < DATE_SUB(NOW(), INTERVAL 7 DAY);
EOF2.2 Disk Defragmentation
# For ext4 filesystem
e4defrag /dev/sda1
# Check fragmentation ratio
e4defrag -c /dev/sda12.3 Bad Block Detection and Repair
# Detect bad blocks (read‑only)
badblocks -v /dev/sda
# Repair bad blocks
fsck -c /dev/sda1Chapter 3: Fault Handling Process
3.1 Fault Grading and Response
P0 – System Down
Response time: within 5 minutes
Recovery time: restore basic services within 30 minutes
Owner: architect + senior ops
P1 – Service Degradation
Response time: within 15 minutes
Recovery time: resolve within 2 hours
Owner: ops team
P2 – Performance Issue
Response time: within 1 hour
Recovery time: resolve within 24 hours
Owner: on‑call ops
3.2 Disk Fault Diagnosis Procedure
# Step 1: Quick diagnosis
dmesg | grep -i "error\|fail\|bad"
cat /var/log/messages | grep -i "disk\|sda"
# Step 2: Detailed check
iostat -x 1 5
iotop -o -d 1
# Step 3: Hardware check
smartctl -a /dev/sda
hdparm -I /dev/sda3.3 Emergency Response Plan
Urgent Data Backup
# Create disk image
dd if=/dev/sda of=/backup/sda_backup.img bs=1MService Degradation Handling
# Stop non‑critical services
systemctl stop httpd
systemctl stop mysqld
# Switch to read‑only mode
mount -o remount,ro /dataChapter 4: Data Recovery Techniques
4.1 Filesystem Recovery
ext4 Recovery
# Repair with e2fsck
e2fsck -f -v /dev/sda1
# Force repair
e2fsck -f -y /dev/sda1XFS Recovery
# Check XFS
xfs_check /dev/sda1
# Repair XFS
xfs_repair /dev/sda14.2 Data Recovery Tools
TestDisk Partition Recovery
# Install TestDisk
yum install testdisk -y
# Run TestDisk
testdisk /dev/sdaPhotoRec File Recovery
# Recover deleted files
photorec /dev/sda4.3 LVM Snapshot Recovery
# Create LVM snapshot
lvcreate -L 10G -s -n backup_snap /dev/vg0/lv_data
# Merge snapshot to restore
lvconvert --merge /dev/vg0/backup_snapChapter 5: Performance Optimization Strategies
5.1 I/O Scheduler Tuning
# View current scheduler
cat /sys/block/sda/queue/scheduler
# Set scheduler (example)
echo noop > /sys/block/sda/queue/scheduler
# Scheduler recommendations
# SSD: noop or deadline
# HDD: cfq or bfq
# Database: deadline5.2 Disk Parameter Optimization
# Set read‑ahead
blockdev --setra 4096 /dev/sda
# Set queue depth
echo 32 > /sys/block/sda/queue/nr_requests
# Disable power‑saving mode
hdparm -B 255 /dev/sda5.3 Filesystem Tuning
# ext4 tuning
tune2fs -o journal_data_writeback /dev/sda1
# XFS tuning
mount -o noatime,nodiratime,largeio,inode64 /dev/sda1 /dataChapter 6: Automation Tools
6.1 Ansible Disk Maintenance Playbook
# disk_maintenance.yml
- name: 磁盘维护任务
hosts: all
tasks:
- name: 检查磁盘使用率
shell: df -h | grep -v tmpfs
register: disk_usage
- name: 执行磁盘清理
shell: |
find /var/log -name "*.log" -mtime +30 -delete
find /tmp -type f -mtime +7 -delete
when: disk_usage.stdout.find('9') != -1
- name: 检查SMART状态
shell: smartctl -H {{ item }}
loop:
- /dev/sda
- /dev/sdb
register: smart_status
- name: 发送告警
mail:
to: [email protected]
subject: "磁盘状态告警"
body: "{{ smart_status.stdout }}"
when: smart_status.stdout.find('PASSED') == -16.2 Monitoring and Alert Integration (Prometheus)
# prometheus.yml snippet
scrape_configs:
- job_name: 'node-exporter'
static_configs:
- targets: ['localhost:9100']
# disk_alerts.yml
groups:
- name: disk.rules
rules:
- alert: DiskSpaceUsage
expr: (1 - node_filesystem_avail_bytes / node_filesystem_size_bytes) * 100 > 90
for: 2m
labels:
severity: warning
annotations:
summary: "磁盘空间不足"
description: "磁盘使用率超过90%"Chapter 7: Best Practices and Experience Sharing
7.1 Enterprise Deployment Recommendations
RAID Strategy
System disk: RAID1 (mirroring)
Data disks: RAID10 (performance + redundancy)
Log disk: RAID5 (cost‑balanced)
Backup Strategy
3‑2‑1 backup rule
Regular backup verification
Off‑site disaster recovery
Monitoring & Alerting
Multi‑level alert mechanism
Automated handling
24/7 monitoring
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
MaGe Linux Operations
Founded in 2009, MaGe Education is a top Chinese high‑end IT training brand. Its graduates earn 12K+ RMB salaries, and the school has trained tens of thousands of students. It offers high‑pay courses in Linux cloud operations, Python full‑stack, automation, data analysis, AI, and Go high‑concurrency architecture. Thanks to quality courses and a solid reputation, it has talent partnerships with numerous internet firms.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
