Operations 42 min read

How to Diagnose and Fix Full Disk Issues on Linux Servers – A Proven Checklist

This guide walks you through a complete Linux disk‑space troubleshooting workflow, from quickly checking usage and inode status, to locating large files with du, ncdu or dust, handling deleted‑but‑still‑open files, cleaning logs, Docker images, temporary data, adjusting reserved space, and setting up monitoring and alerts.

MaGe Linux Operations
MaGe Linux Operations
MaGe Linux Operations
How to Diagnose and Fix Full Disk Issues on Linux Servers – A Proven Checklist

Overview

Disk‑space exhaustion is one of the most common production incidents on Linux servers. It accounts for roughly 15‑20% of all alerts and can cause cascading failures when databases, applications, or caches cannot write data.

1.1 Background

Typical causes include continuously growing log files, core‑dump files from memory leaks, uncleaned temporary files, accumulated Docker images, and runaway database binlog or WAL files.

1.2 Technical Characteristics

Storage vs. inode – A filesystem has two independent limits: total block space and the number of inodes. Even with free blocks, a full inode table prevents new files from being created.

Deleted but still open files – Files removed from the directory tree remain on disk as long as a process holds an open file descriptor.

Reserved space – ext4 reserves 5 % of blocks for the root user; this can be reduced in emergencies.

1.3 Applicable Scenarios

Production alerts for high disk usage

Applications reporting “No space left on device”

Suspicious I/O latency

Regular health‑check inspections

Planning disk‑cleanup policies

1.4 Environment Requirements

OS: Ubuntu 22.04 LTS / CentOS 7.9 / Rocky Linux 9
Filesystem: ext4 / xfs
Kernel: 5.15+
Container runtime: Docker 24.x / containerd 1.7.x
Monitoring: Prometheus 2.47+ / Grafana 10.x

Step‑by‑Step Procedure

2.1 Quick Disk Status

Run df -h to view block usage and df -i to view inode usage.

df -h
Filesystem      Size  Used Avail Use% Mounted on
/dev/sda1       100G   95G   5G   95% /
/dev/sdc1       200G  200G   0   100% /var/log

2.2 Locate Large Files and Directories

Three tools are recommended:

du – basic, slow, no UI.

ncdu – interactive ncurses UI, can delete files.

dust – Rust‑based, fast, tree‑style output.

Example du usage:

# Top‑level directories
du -sh /* 2>/dev/null | sort -rh | head -20
# Drill into /var
du -sh /var/* 2>/dev/null | sort -rh | head -10

Example ncdu usage: ncdu /var/log Example dust usage:

dust -d 2 /var

2.3 Check Deleted but Still Open Files

Use lsof +L1 to list such files, then either restart the owning process or truncate the file:

# Restart process (recommended)
systemctl restart app
# Truncate without restart
truncate -s 0 /var/log/app/debug.log

2.4 Handle Inode Exhaustion

When df -i shows >90 % inode usage, count files per directory to find the culprit:

for dir in /*; do echo -n "$dir: "; find "$dir" -xdev -type f | wc -l; done | sort -rn | head -10

Common inode killers: mail queues, PHP session files, cache directories, Docker layer files. Clean them with appropriate commands (e.g., postsuper -d ALL for Postfix, find /var/lib/php/sessions -mtime +7 -delete for PHP sessions).

2.5 Release Reserved Space (Emergency)

Check current reservation and reduce it if needed:

# Show reservation
tune2fs -l /dev/sda1 | grep "Reserved block count"
# Reduce to 1 %
sudo tune2fs -m 1 /dev/sda1

Example Scripts and Configurations

3.1 Disk‑Cleanup Script (bash)

#!/bin/bash
set -euo pipefail
LOG_RETAIN_DAYS=30
TMP_RETAIN_DAYS=7
DOCKER_IMAGE_AGE="720h"
MIN_FREE_PERCENT=10
DRY_RUN=${DRY_RUN:-false}

log_info(){ echo -e "[INFO] $(date '+%Y-%m-%d %H:%M:%S') $1"; }
log_warn(){ echo -e "[WARN] $(date '+%Y-%m-%d %H:%M:%S') $1"; }
log_error(){ echo -e "[ERROR] $(date '+%Y-%m-%d %H:%M:%S') $1"; }

get_free_space(){ df -BG "$1" | awk 'NR==2{gsub(/G/,"",$4); print $4}'; }

show_cleanup_result(){ local before=$1 after=$2 item=$3; local freed=$((before-after)); if [ $freed -gt 0 ]; then log_info "$item: freed ${freed}GB"; else log_info "$item: no space reclaimed"; fi; }

cleanup_system_logs(){ log_info "Cleaning system logs..."; local before=$(get_free_space); if [ "$DRY_RUN" = "true" ]; then log_info "[DRY‑RUN] would delete logs older than $LOG_RETAIN_DAYS days"; else
  find /var/log -type f -name "*.log" -mtime +$LOG_RETAIN_DAYS -delete
  find /var/log -type f -name "*.gz" -mtime +$LOG_RETAIN_DAYS -delete
  command -v journalctl >/dev/null && journalctl --vacuum-time=${LOG_RETAIN_DAYS}d
fi
local after=$(get_free_space); show_cleanup_result $before $after "System logs"; }

cleanup_temp_files(){ log_info "Cleaning temporary files..."; local before=$(get_free_space); if [ "$DRY_RUN" = "true" ]; then log_info "[DRY‑RUN] would delete /tmp files older than $TMP_RETAIN_DAYS days"; else
  find /tmp -type f -atime +$TMP_RETAIN_DAYS -delete
  find /var/tmp -type f -atime +$TMP_RETAIN_DAYS -delete
  find /tmp -type d -empty -delete
  find /var/tmp -type d -empty -delete
fi
local after=$(get_free_space); show_cleanup_result $before $after "Temp files"; }

cleanup_package_cache(){ log_info "Cleaning package manager caches..."; local before=$(get_free_space);
if [ "$DRY_RUN" = "true" ]; then log_info "[DRY‑RUN] would run apt/ yum clean"; else
  command -v apt >/dev/null && apt clean && apt autoclean
  command -v yum >/dev/null && yum clean all
  command -v dnf >/dev/null && dnf clean all
fi
local after=$(get_free_space); show_cleanup_result $before $after "Package cache"; }

cleanup_docker(){ command -v docker >/dev/null || { log_info "Docker not installed, skipping"; return; }
log_info "Cleaning Docker resources..."; local before=$(get_free_space);
if [ "$DRY_RUN" = "true" ]; then log_info "[DRY‑RUN] docker system df"; else
  docker container prune -f
  docker image prune -a -f
  docker image prune -a -f --filter "until=$DOCKER_IMAGE_AGE"
  docker volume prune -f
  docker system prune -a -f --volumes
fi
local after=$(get_free_space); show_cleanup_result $before $after "Docker"; }

check_deleted_files(){ log_info "Checking for deleted but still allocated files..."; local size_gb=$(lsof +L1 2>/dev/null | awk '{sum+=$7} END {print int(sum/1024/1024/1024)}'); if [ $size_gb -gt 0 ]; then log_warn "Found $size_gb GB of deleted files:"; lsof +L1 2>/dev/null | awk '$7>104857600{print $1,$2,$7/1024/1024/1024"GB",$NF}'; else log_info "No such files found"; fi; }

main(){ log_info "========== Disk cleanup start =========="; log_info "Current usage: $(df -h | awk 'NR==2{print $5}')"; log_info "Free space: $(get_free_space)GB";
if [ "$DRY_RUN" = "true" ]; then log_warn "Running in DRY‑RUN mode – no files will be deleted"; fi;
cleanup_system_logs
cleanup_temp_files
cleanup_package_cache
cleanup_docker
check_deleted_files
log_info "========== Disk cleanup complete =========="; log_info "Post‑cleanup usage: $(df -h | awk 'NR==2{print $5}')"; log_info "Free space: $(get_free_space)GB"; }

show_help(){ echo "Usage: $0 [options]"; echo "Options:"; echo "  -d, --dry-run   Simulate actions without deleting"; echo "  -h, --help      Show this help"; }

while [[ $# -gt 0 ]]; do case $1 in -d|--dry-run) DRY_RUN=true; shift;; -h|--help) show_help; exit 0;; *) log_error "Unknown option: $1"; show_help; exit 1;; esac; done
if [ $EUID -ne 0 ]; then log_error "Please run as root"; exit 1; fi
main

3.2 Logrotate Configuration

# /etc/logrotate.conf – global settings
weekly
rotate 4
create
dateext
compress
delaycompress
notifempty
include /etc/logrotate.d

# /etc/logrotate.d/nginx – per‑application example
/var/log/nginx/*.log {
    daily
    missingok
    rotate 14
    compress
    delaycompress
    notifempty
    create 0640 www-data adm
    sharedscripts
    postrotate
        [ -f /var/run/nginx.pid ] && kill -USR1 `cat /var/run/nginx.pid`
    endscript
}

3.3 Security Hardening

Enable user quotas on shared partitions to prevent a single user from exhausting space:

# /etc/fstab – add usrquota,grpquota
/dev/sda1 /home ext4 defaults,usrquota,grpquota 0 2
mount -o remount /home
quotacheck -cum /home
quotaon /home
setquota -u alice 10485760 12582912 0 0 /home   # 10 GB soft, 12 GB hard
repquota -a

Use tmpfs or systemd‑tmpfiles to limit temporary directory size.

Monitoring and Alerting

5.1 Log Inspection

# System logs
grep -i "disk\|space\|full\|no space" /var/log/syslog /var/log/messages
# Kernel messages
dmesg | grep -i "error\|fail\|disk"
# systemd journal
journalctl -p err -b | grep -i disk

5.2 Real‑time Metrics

iostat – per‑device I/O statistics: iostat -x 2 iotop – per‑process I/O usage:

iotop -ao

5.3 Prometheus Node Exporter + Alert Rules

Deploy node_exporter (systemd service shown in source) and add the following alert rules (saved as disk_alerts.yml):

groups:
- name: disk_alerts
  rules:
  - alert: DiskSpaceWarning
    expr: (node_filesystem_avail_bytes{fstype=~"ext4|xfs|btrfs"} / node_filesystem_size_bytes{fstype=~"ext4|xfs|btrfs"}) * 100 < 20
    for: 5m
    labels:
      severity: warning
    annotations:
      summary: "Disk space warning on {{ $labels.instance }}"
      description: "{{ $labels.mountpoint }} free space below 20 % ({{ $value | printf \"%.1f\" }}%)."
  - alert: DiskSpaceCritical
    expr: (node_filesystem_avail_bytes{fstype=~"ext4|xfs|btrfs"} / node_filesystem_size_bytes{fstype=~"ext4|xfs|btrfs"}) * 100 < 10
    for: 2m
    labels:
      severity: critical
    annotations:
      summary: "Disk space critical on {{ $labels.instance }}"
      description: "{{ $labels.mountpoint }} free space below 10 % ({{ $value | printf \"%.1f\" }}%)."
  - alert: InodeWarning
    expr: (node_filesystem_files_free{fstype=~"ext4|xfs"} / node_filesystem_files{fstype=~"ext4|xfs"}) * 100 < 20
    for: 5m
    labels:
      severity: warning
    annotations:
      summary: "Inode usage warning on {{ $labels.instance }}"
      description: "{{ $labels.mountpoint }} inode free below 20 % ({{ $value | printf \"%.1f\" }}%)."
  - alert: DiskIOHigh
    expr: rate(node_disk_io_time_seconds_total[5m]) * 100 > 80
    for: 10m
    labels:
      severity: warning
    annotations:
      summary: "High disk I/O on {{ $labels.instance }}"
      description: "Device {{ $labels.device }} I/O utilization >80 % ({{ $value | printf \"%.1f\" }}%)."

5.4 Alertmanager Routing

global:
  resolve_timeout: 5m
  smtp_smarthost: 'smtp.company.com:587'
  smtp_from: '[email protected]'
  smtp_auth_username: '[email protected]'
  smtp_auth_password: 'password'

route:
  group_by: ['alertname','instance']
  group_wait: 30s
  group_interval: 5m
  repeat_interval: 4h
  receiver: default-receiver
  routes:
  - match:
      severity: emergency
    receiver: emergency-receiver
    group_wait: 10s
    repeat_interval: 30m
  - match:
      severity: critical
    receiver: critical-receiver
    group_wait: 30s
    repeat_interval: 1h

receivers:
- name: default-receiver
  email_configs:
  - to: '[email protected]'
  webhook_configs:
  - url: 'https://hooks.slack.com/services/xxx/yyy/zzz'
- name: critical-receiver
  email_configs:
  - to: '[email protected],[email protected]'
  webhook_configs:
  - url: 'https://hooks.slack.com/services/xxx/yyy/zzz'
- name: emergency-receiver
  email_configs:
  - to: '[email protected],[email protected],[email protected]'
  webhook_configs:
  - url: 'https://hooks.slack.com/services/xxx/yyy/zzz'

5.5 Grafana Dashboard (JSON snippet)

{
  "panels": [
    {
      "title": "Disk Usage",
      "type": "gauge",
      "targets": [{"expr": "(1 - node_filesystem_avail_bytes{mountpoint=\"/\"} / node_filesystem_size_bytes{mountpoint=\"/\"}) * 100", "legendFormat": "{{ mountpoint }}"}]
    },
    {
      "title": "Disk Space Trend (7d)",
      "type": "timeseries",
      "targets": [{"expr": "node_filesystem_avail_bytes{mountpoint=\"/\"} / 1024 / 1024 / 1024", "legendFormat": "Free (GB)"}, {"expr": "node_filesystem_size_bytes{mountpoint=\"/\"} / 1024 / 1024 / 1024", "legendFormat": "Total (GB)"}]
    }
  ]
}

5.6 Automated Alert Response (Webhook Service)

A minimal Flask webhook can automatically run safe cleanup commands when a warning‑level alert fires. Only non‑critical actions (journal vacuum, apt clean, Docker prune, /tmp cleanup) are executed; critical alerts require manual intervention.

from flask import Flask, request
import subprocess, logging
app = Flask(__name__)
logging.basicConfig(level=logging.INFO)
SAFE_CMDS = {
    'journal': 'journalctl --vacuum-time=3d',
    'apt_cache': 'apt clean',
    'docker_prune': 'docker container prune -f',
    'tmp_old': 'find /tmp -type f -atime +3 -delete'
}
@app.route('/webhook', methods=['POST'])
def handle():
    data = request.json
    for alert in data.get('alerts', []):
        if alert['status'] != 'firing':
            continue
        severity = alert['labels'].get('severity', 'warning')
        if severity == 'warning':
            for name, cmd in SAFE_CMDS.items():
                logging.info(f"Executing {name}")
                subprocess.run(cmd, shell=True, timeout=300)
    return 'OK', 200
if __name__ == '__main__':
    app.run(host='0.0.0.0', port=5000)

Conclusion

Key take‑aways:

Always check both block usage ( df -h) and inode usage ( df -i).

Use ncdu or dust for fast interactive discovery.

Detect deleted‑but‑still‑open files with lsof +L1 and either restart the process or truncate the file.

When inode exhaustion is the root cause, locate directories with massive file counts and clean mail queues, PHP sessions, or other small‑file spammers.

Adjust ext4 reserved space with tune2fs -m only as an emergency measure.

Automate regular cleanup via the provided bash script and schedule it with cron or a systemd timer.

Implement proactive monitoring (node_exporter, Prometheus alerts, Grafana dashboards) to catch usage before it reaches critical thresholds.

Adopt good partition planning, proper log‑rotation, per‑user quotas, and continuous monitoring to keep disk‑related outages at bay.

Further Reading

Linux Filesystem Hierarchy Standard – https://refspecs.linuxfoundation.org/FHS_3.0/fhs/index.html

Prometheus Node Exporter – https://github.com/prometheus/node_exporter

ncdu Documentation – https://dev.yorhel.nl/ncdu

logrotate Manual – man logrotate

systemd‑tmpfiles Manual – man tmpfiles.d

Linuxsysadmindisk space
MaGe Linux Operations
Written by

MaGe Linux Operations

Founded in 2009, MaGe Education is a top Chinese high‑end IT training brand. Its graduates earn 12K+ RMB salaries, and the school has trained tens of thousands of students. It offers high‑pay courses in Linux cloud operations, Python full‑stack, automation, data analysis, AI, and Go high‑concurrency architecture. Thanks to quality courses and a solid reputation, it has talent partnerships with numerous internet firms.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.