Operations 13 min read

Master Enterprise Linux Disk Maintenance: From Monitoring to Recovery

This comprehensive guide walks operations engineers through enterprise‑level Linux disk maintenance, covering health monitoring, SMART implementation, automated scripts, preventive cleanup, fault grading, diagnosis procedures, data recovery techniques, performance tuning, and automation with Ansible and Prometheus, enabling proactive prevention and rapid response to storage issues.

MaGe Linux Operations

Jul 18, 2025

Master Enterprise Linux Disk Maintenance: From Monitoring to Recovery

Introduction

In enterprise Linux environments, disk failures cause most outages; over 70% of server failures are storage‑related. Operations engineers need a complete disk maintenance process to prevent failures and rescue systems.

This guide covers monitoring, alerts, fault handling, data recovery, and more.

Chapter 1: Disk Health Monitoring System

1.1 Basic Monitoring Metrics

Key hardware metrics: Disk temperature (<55°C), read/write error rate, reallocated sector count, uncorrectable sector count, disk utilization.

Performance metrics: IOPS, response time, queue depth, bandwidth utilization.

1.2 SMART Monitoring Implementation

# Install smartmontools
yum install smartmontools -y

# Check SMART status
smartctl -a /dev/sda

# Enable SMART self‑test
smartctl -s on /dev/sda

# Run short self‑test
smartctl -t short /dev/sda

# View self‑test results
smartctl -l selftest /dev/sda

1.3 Automated Monitoring Script

#!/bin/bash
# disk_monitor.sh - Disk monitoring script
DISK_LIST="/dev/sda /dev/sdb /dev/sdc"
LOG_FILE="/var/log/disk_monitor.log"
ALERT_THRESHOLD=90

for disk in $DISK_LIST; do
    # Check disk usage
    usage=$(df -h $disk | tail -1 | awk '{print $5}' | sed 's/%//')
    if [ $usage -gt $ALERT_THRESHOLD ]; then
        echo "$(date): WARNING - $disk usage is ${usage}%" >> $LOG_FILE
        # Send alert email
        echo "Disk $disk usage reached ${usage}%" | mail -s "Disk Alert" [email protected]
    fi

    # Check SMART status
    smart_status=$(smartctl -H $disk | grep "SMART overall-health")
    if [[ $smart_status != *"PASSED"* ]]; then
        echo "$(date): CRITICAL - $disk SMART check failed" >> $LOG_FILE
        echo "Disk $disk SMART check failed, please handle immediately" | mail -s "Disk Critical Alert" [email protected]
    fi
done

Chapter 2: Preventive Maintenance Process

2.1 Regular Cleanup Strategy

Log Cleanup

# Clean system logs older than 30 days
find /var/log -name "*.log" -mtime +30 -exec rm {} \;

# Clean temporary files older than 7 days
find /tmp -type f -mtime +7 -exec rm {} \;

# Clean cache files older than 30 days
find /var/cache -type f -mtime +30 -exec rm {} \;

Database Maintenance

# MySQL slow log cleanup
mysql -u root -p <<EOF
DELETE FROM mysql.slow_log WHERE start_time < DATE_SUB(NOW(), INTERVAL 30 DAY);
DELETE FROM mysql.general_log WHERE event_time < DATE_SUB(NOW(), INTERVAL 7 DAY);
EOF

2.2 Disk Defragmentation

# For ext4 filesystem
e4defrag /dev/sda1

# Check fragmentation ratio
e4defrag -c /dev/sda1

2.3 Bad Block Detection and Repair

# Detect bad blocks (read‑only)
badblocks -v /dev/sda

# Repair bad blocks
fsck -c /dev/sda1

Chapter 3: Fault Handling Process

3.1 Fault Grading and Response

P0 – System Down

Response time: within 5 minutes

Recovery time: restore basic services within 30 minutes

Owner: architect + senior ops

P1 – Service Degradation

Response time: within 15 minutes

Recovery time: resolve within 2 hours

Owner: ops team

P2 – Performance Issue

Response time: within 1 hour

Recovery time: resolve within 24 hours

Owner: on‑call ops

3.2 Disk Fault Diagnosis Procedure

# Step 1: Quick diagnosis
dmesg | grep -i "error\|fail\|bad"
cat /var/log/messages | grep -i "disk\|sda"

# Step 2: Detailed check
iostat -x 1 5
iotop -o -d 1

# Step 3: Hardware check
smartctl -a /dev/sda
hdparm -I /dev/sda

3.3 Emergency Response Plan

Urgent Data Backup

# Create disk image
dd if=/dev/sda of=/backup/sda_backup.img bs=1M

Service Degradation Handling

# Stop non‑critical services
systemctl stop httpd
systemctl stop mysqld

# Switch to read‑only mode
mount -o remount,ro /data

Chapter 4: Data Recovery Techniques

4.1 Filesystem Recovery

ext4 Recovery

# Repair with e2fsck
e2fsck -f -v /dev/sda1

# Force repair
e2fsck -f -y /dev/sda1

XFS Recovery

# Check XFS
xfs_check /dev/sda1

# Repair XFS
xfs_repair /dev/sda1

4.2 Data Recovery Tools

TestDisk Partition Recovery

# Install TestDisk
yum install testdisk -y

# Run TestDisk
testdisk /dev/sda

PhotoRec File Recovery

# Recover deleted files
photorec /dev/sda

4.3 LVM Snapshot Recovery

# Create LVM snapshot
lvcreate -L 10G -s -n backup_snap /dev/vg0/lv_data

# Merge snapshot to restore
lvconvert --merge /dev/vg0/backup_snap

Chapter 5: Performance Optimization Strategies

5.1 I/O Scheduler Tuning

# View current scheduler
cat /sys/block/sda/queue/scheduler

# Set scheduler (example)
echo noop > /sys/block/sda/queue/scheduler

# Scheduler recommendations
# SSD: noop or deadline
# HDD: cfq or bfq
# Database: deadline

5.2 Disk Parameter Optimization

# Set read‑ahead
blockdev --setra 4096 /dev/sda

# Set queue depth
echo 32 > /sys/block/sda/queue/nr_requests

# Disable power‑saving mode
hdparm -B 255 /dev/sda

5.3 Filesystem Tuning

# ext4 tuning
tune2fs -o journal_data_writeback /dev/sda1

# XFS tuning
mount -o noatime,nodiratime,largeio,inode64 /dev/sda1 /data

Chapter 6: Automation Tools

6.1 Ansible Disk Maintenance Playbook

# disk_maintenance.yml
- name: 磁盘维护任务
  hosts: all
  tasks:
    - name: 检查磁盘使用率
      shell: df -h | grep -v tmpfs
      register: disk_usage

    - name: 执行磁盘清理
      shell: |
        find /var/log -name "*.log" -mtime +30 -delete
        find /tmp -type f -mtime +7 -delete
      when: disk_usage.stdout.find('9') != -1

    - name: 检查SMART状态
      shell: smartctl -H {{ item }}
      loop:
        - /dev/sda
        - /dev/sdb
      register: smart_status

    - name: 发送告警
      mail:
        to: [email protected]
        subject: "磁盘状态告警"
        body: "{{ smart_status.stdout }}"
      when: smart_status.stdout.find('PASSED') == -1

6.2 Monitoring and Alert Integration (Prometheus)

# prometheus.yml snippet
scrape_configs:
  - job_name: 'node-exporter'
    static_configs:
      - targets: ['localhost:9100']

# disk_alerts.yml
groups:
  - name: disk.rules
    rules:
      - alert: DiskSpaceUsage
        expr: (1 - node_filesystem_avail_bytes / node_filesystem_size_bytes) * 100 > 90
        for: 2m
        labels:
          severity: warning
        annotations:
          summary: "磁盘空间不足"
          description: "磁盘使用率超过90%"

Chapter 7: Best Practices and Experience Sharing

7.1 Enterprise Deployment Recommendations

RAID Strategy

System disk: RAID1 (mirroring)

Data disks: RAID10 (performance + redundancy)

Log disk: RAID5 (cost‑balanced)

Backup Strategy

3‑2‑1 backup rule

Regular backup verification

Off‑site disaster recovery

Monitoring & Alerting

Multi‑level alert mechanism

Automated handling

24/7 monitoring

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Disk Maintenance

Written by

MaGe Linux Operations

Founded in 2009, MaGe Education is a top Chinese high‑end IT training brand. Its graduates earn 12K+ RMB salaries, and the school has trained tens of thousands of students. It offers high‑pay courses in Linux cloud operations, Python full‑stack, automation, data analysis, AI, and Go high‑concurrency architecture. Thanks to quality courses and a solid reputation, it has talent partnerships with numerous internet firms.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.