How to Diagnose and Fix Full Disk Issues on Linux Servers – A Proven Checklist
This guide walks you through a complete Linux disk‑space troubleshooting workflow, from quickly checking usage and inode status, to locating large files with du, ncdu or dust, handling deleted‑but‑still‑open files, cleaning logs, Docker images, temporary data, adjusting reserved space, and setting up monitoring and alerts.
Overview
Disk‑space exhaustion is one of the most common production incidents on Linux servers. It accounts for roughly 15‑20% of all alerts and can cause cascading failures when databases, applications, or caches cannot write data.
1.1 Background
Typical causes include continuously growing log files, core‑dump files from memory leaks, uncleaned temporary files, accumulated Docker images, and runaway database binlog or WAL files.
1.2 Technical Characteristics
Storage vs. inode – A filesystem has two independent limits: total block space and the number of inodes. Even with free blocks, a full inode table prevents new files from being created.
Deleted but still open files – Files removed from the directory tree remain on disk as long as a process holds an open file descriptor.
Reserved space – ext4 reserves 5 % of blocks for the root user; this can be reduced in emergencies.
1.3 Applicable Scenarios
Production alerts for high disk usage
Applications reporting “No space left on device”
Suspicious I/O latency
Regular health‑check inspections
Planning disk‑cleanup policies
1.4 Environment Requirements
OS: Ubuntu 22.04 LTS / CentOS 7.9 / Rocky Linux 9
Filesystem: ext4 / xfs
Kernel: 5.15+
Container runtime: Docker 24.x / containerd 1.7.x
Monitoring: Prometheus 2.47+ / Grafana 10.xStep‑by‑Step Procedure
2.1 Quick Disk Status
Run df -h to view block usage and df -i to view inode usage.
df -h
Filesystem Size Used Avail Use% Mounted on
/dev/sda1 100G 95G 5G 95% /
/dev/sdc1 200G 200G 0 100% /var/log2.2 Locate Large Files and Directories
Three tools are recommended:
du – basic, slow, no UI.
ncdu – interactive ncurses UI, can delete files.
dust – Rust‑based, fast, tree‑style output.
Example du usage:
# Top‑level directories
du -sh /* 2>/dev/null | sort -rh | head -20
# Drill into /var
du -sh /var/* 2>/dev/null | sort -rh | head -10Example ncdu usage: ncdu /var/log Example dust usage:
dust -d 2 /var2.3 Check Deleted but Still Open Files
Use lsof +L1 to list such files, then either restart the owning process or truncate the file:
# Restart process (recommended)
systemctl restart app
# Truncate without restart
truncate -s 0 /var/log/app/debug.log2.4 Handle Inode Exhaustion
When df -i shows >90 % inode usage, count files per directory to find the culprit:
for dir in /*; do echo -n "$dir: "; find "$dir" -xdev -type f | wc -l; done | sort -rn | head -10Common inode killers: mail queues, PHP session files, cache directories, Docker layer files. Clean them with appropriate commands (e.g., postsuper -d ALL for Postfix, find /var/lib/php/sessions -mtime +7 -delete for PHP sessions).
2.5 Release Reserved Space (Emergency)
Check current reservation and reduce it if needed:
# Show reservation
tune2fs -l /dev/sda1 | grep "Reserved block count"
# Reduce to 1 %
sudo tune2fs -m 1 /dev/sda1Example Scripts and Configurations
3.1 Disk‑Cleanup Script (bash)
#!/bin/bash
set -euo pipefail
LOG_RETAIN_DAYS=30
TMP_RETAIN_DAYS=7
DOCKER_IMAGE_AGE="720h"
MIN_FREE_PERCENT=10
DRY_RUN=${DRY_RUN:-false}
log_info(){ echo -e "[INFO] $(date '+%Y-%m-%d %H:%M:%S') $1"; }
log_warn(){ echo -e "[WARN] $(date '+%Y-%m-%d %H:%M:%S') $1"; }
log_error(){ echo -e "[ERROR] $(date '+%Y-%m-%d %H:%M:%S') $1"; }
get_free_space(){ df -BG "$1" | awk 'NR==2{gsub(/G/,"",$4); print $4}'; }
show_cleanup_result(){ local before=$1 after=$2 item=$3; local freed=$((before-after)); if [ $freed -gt 0 ]; then log_info "$item: freed ${freed}GB"; else log_info "$item: no space reclaimed"; fi; }
cleanup_system_logs(){ log_info "Cleaning system logs..."; local before=$(get_free_space); if [ "$DRY_RUN" = "true" ]; then log_info "[DRY‑RUN] would delete logs older than $LOG_RETAIN_DAYS days"; else
find /var/log -type f -name "*.log" -mtime +$LOG_RETAIN_DAYS -delete
find /var/log -type f -name "*.gz" -mtime +$LOG_RETAIN_DAYS -delete
command -v journalctl >/dev/null && journalctl --vacuum-time=${LOG_RETAIN_DAYS}d
fi
local after=$(get_free_space); show_cleanup_result $before $after "System logs"; }
cleanup_temp_files(){ log_info "Cleaning temporary files..."; local before=$(get_free_space); if [ "$DRY_RUN" = "true" ]; then log_info "[DRY‑RUN] would delete /tmp files older than $TMP_RETAIN_DAYS days"; else
find /tmp -type f -atime +$TMP_RETAIN_DAYS -delete
find /var/tmp -type f -atime +$TMP_RETAIN_DAYS -delete
find /tmp -type d -empty -delete
find /var/tmp -type d -empty -delete
fi
local after=$(get_free_space); show_cleanup_result $before $after "Temp files"; }
cleanup_package_cache(){ log_info "Cleaning package manager caches..."; local before=$(get_free_space);
if [ "$DRY_RUN" = "true" ]; then log_info "[DRY‑RUN] would run apt/ yum clean"; else
command -v apt >/dev/null && apt clean && apt autoclean
command -v yum >/dev/null && yum clean all
command -v dnf >/dev/null && dnf clean all
fi
local after=$(get_free_space); show_cleanup_result $before $after "Package cache"; }
cleanup_docker(){ command -v docker >/dev/null || { log_info "Docker not installed, skipping"; return; }
log_info "Cleaning Docker resources..."; local before=$(get_free_space);
if [ "$DRY_RUN" = "true" ]; then log_info "[DRY‑RUN] docker system df"; else
docker container prune -f
docker image prune -a -f
docker image prune -a -f --filter "until=$DOCKER_IMAGE_AGE"
docker volume prune -f
docker system prune -a -f --volumes
fi
local after=$(get_free_space); show_cleanup_result $before $after "Docker"; }
check_deleted_files(){ log_info "Checking for deleted but still allocated files..."; local size_gb=$(lsof +L1 2>/dev/null | awk '{sum+=$7} END {print int(sum/1024/1024/1024)}'); if [ $size_gb -gt 0 ]; then log_warn "Found $size_gb GB of deleted files:"; lsof +L1 2>/dev/null | awk '$7>104857600{print $1,$2,$7/1024/1024/1024"GB",$NF}'; else log_info "No such files found"; fi; }
main(){ log_info "========== Disk cleanup start =========="; log_info "Current usage: $(df -h | awk 'NR==2{print $5}')"; log_info "Free space: $(get_free_space)GB";
if [ "$DRY_RUN" = "true" ]; then log_warn "Running in DRY‑RUN mode – no files will be deleted"; fi;
cleanup_system_logs
cleanup_temp_files
cleanup_package_cache
cleanup_docker
check_deleted_files
log_info "========== Disk cleanup complete =========="; log_info "Post‑cleanup usage: $(df -h | awk 'NR==2{print $5}')"; log_info "Free space: $(get_free_space)GB"; }
show_help(){ echo "Usage: $0 [options]"; echo "Options:"; echo " -d, --dry-run Simulate actions without deleting"; echo " -h, --help Show this help"; }
while [[ $# -gt 0 ]]; do case $1 in -d|--dry-run) DRY_RUN=true; shift;; -h|--help) show_help; exit 0;; *) log_error "Unknown option: $1"; show_help; exit 1;; esac; done
if [ $EUID -ne 0 ]; then log_error "Please run as root"; exit 1; fi
main3.2 Logrotate Configuration
# /etc/logrotate.conf – global settings
weekly
rotate 4
create
dateext
compress
delaycompress
notifempty
include /etc/logrotate.d
# /etc/logrotate.d/nginx – per‑application example
/var/log/nginx/*.log {
daily
missingok
rotate 14
compress
delaycompress
notifempty
create 0640 www-data adm
sharedscripts
postrotate
[ -f /var/run/nginx.pid ] && kill -USR1 `cat /var/run/nginx.pid`
endscript
}3.3 Security Hardening
Enable user quotas on shared partitions to prevent a single user from exhausting space:
# /etc/fstab – add usrquota,grpquota
/dev/sda1 /home ext4 defaults,usrquota,grpquota 0 2
mount -o remount /home
quotacheck -cum /home
quotaon /home
setquota -u alice 10485760 12582912 0 0 /home # 10 GB soft, 12 GB hard
repquota -aUse tmpfs or systemd‑tmpfiles to limit temporary directory size.
Monitoring and Alerting
5.1 Log Inspection
# System logs
grep -i "disk\|space\|full\|no space" /var/log/syslog /var/log/messages
# Kernel messages
dmesg | grep -i "error\|fail\|disk"
# systemd journal
journalctl -p err -b | grep -i disk5.2 Real‑time Metrics
iostat – per‑device I/O statistics: iostat -x 2 iotop – per‑process I/O usage:
iotop -ao5.3 Prometheus Node Exporter + Alert Rules
Deploy node_exporter (systemd service shown in source) and add the following alert rules (saved as disk_alerts.yml):
groups:
- name: disk_alerts
rules:
- alert: DiskSpaceWarning
expr: (node_filesystem_avail_bytes{fstype=~"ext4|xfs|btrfs"} / node_filesystem_size_bytes{fstype=~"ext4|xfs|btrfs"}) * 100 < 20
for: 5m
labels:
severity: warning
annotations:
summary: "Disk space warning on {{ $labels.instance }}"
description: "{{ $labels.mountpoint }} free space below 20 % ({{ $value | printf \"%.1f\" }}%)."
- alert: DiskSpaceCritical
expr: (node_filesystem_avail_bytes{fstype=~"ext4|xfs|btrfs"} / node_filesystem_size_bytes{fstype=~"ext4|xfs|btrfs"}) * 100 < 10
for: 2m
labels:
severity: critical
annotations:
summary: "Disk space critical on {{ $labels.instance }}"
description: "{{ $labels.mountpoint }} free space below 10 % ({{ $value | printf \"%.1f\" }}%)."
- alert: InodeWarning
expr: (node_filesystem_files_free{fstype=~"ext4|xfs"} / node_filesystem_files{fstype=~"ext4|xfs"}) * 100 < 20
for: 5m
labels:
severity: warning
annotations:
summary: "Inode usage warning on {{ $labels.instance }}"
description: "{{ $labels.mountpoint }} inode free below 20 % ({{ $value | printf \"%.1f\" }}%)."
- alert: DiskIOHigh
expr: rate(node_disk_io_time_seconds_total[5m]) * 100 > 80
for: 10m
labels:
severity: warning
annotations:
summary: "High disk I/O on {{ $labels.instance }}"
description: "Device {{ $labels.device }} I/O utilization >80 % ({{ $value | printf \"%.1f\" }}%)."5.4 Alertmanager Routing
global:
resolve_timeout: 5m
smtp_smarthost: 'smtp.company.com:587'
smtp_from: '[email protected]'
smtp_auth_username: '[email protected]'
smtp_auth_password: 'password'
route:
group_by: ['alertname','instance']
group_wait: 30s
group_interval: 5m
repeat_interval: 4h
receiver: default-receiver
routes:
- match:
severity: emergency
receiver: emergency-receiver
group_wait: 10s
repeat_interval: 30m
- match:
severity: critical
receiver: critical-receiver
group_wait: 30s
repeat_interval: 1h
receivers:
- name: default-receiver
email_configs:
- to: '[email protected]'
webhook_configs:
- url: 'https://hooks.slack.com/services/xxx/yyy/zzz'
- name: critical-receiver
email_configs:
- to: '[email protected],[email protected]'
webhook_configs:
- url: 'https://hooks.slack.com/services/xxx/yyy/zzz'
- name: emergency-receiver
email_configs:
- to: '[email protected],[email protected],[email protected]'
webhook_configs:
- url: 'https://hooks.slack.com/services/xxx/yyy/zzz'5.5 Grafana Dashboard (JSON snippet)
{
"panels": [
{
"title": "Disk Usage",
"type": "gauge",
"targets": [{"expr": "(1 - node_filesystem_avail_bytes{mountpoint=\"/\"} / node_filesystem_size_bytes{mountpoint=\"/\"}) * 100", "legendFormat": "{{ mountpoint }}"}]
},
{
"title": "Disk Space Trend (7d)",
"type": "timeseries",
"targets": [{"expr": "node_filesystem_avail_bytes{mountpoint=\"/\"} / 1024 / 1024 / 1024", "legendFormat": "Free (GB)"}, {"expr": "node_filesystem_size_bytes{mountpoint=\"/\"} / 1024 / 1024 / 1024", "legendFormat": "Total (GB)"}]
}
]
}5.6 Automated Alert Response (Webhook Service)
A minimal Flask webhook can automatically run safe cleanup commands when a warning‑level alert fires. Only non‑critical actions (journal vacuum, apt clean, Docker prune, /tmp cleanup) are executed; critical alerts require manual intervention.
from flask import Flask, request
import subprocess, logging
app = Flask(__name__)
logging.basicConfig(level=logging.INFO)
SAFE_CMDS = {
'journal': 'journalctl --vacuum-time=3d',
'apt_cache': 'apt clean',
'docker_prune': 'docker container prune -f',
'tmp_old': 'find /tmp -type f -atime +3 -delete'
}
@app.route('/webhook', methods=['POST'])
def handle():
data = request.json
for alert in data.get('alerts', []):
if alert['status'] != 'firing':
continue
severity = alert['labels'].get('severity', 'warning')
if severity == 'warning':
for name, cmd in SAFE_CMDS.items():
logging.info(f"Executing {name}")
subprocess.run(cmd, shell=True, timeout=300)
return 'OK', 200
if __name__ == '__main__':
app.run(host='0.0.0.0', port=5000)Conclusion
Key take‑aways:
Always check both block usage ( df -h) and inode usage ( df -i).
Use ncdu or dust for fast interactive discovery.
Detect deleted‑but‑still‑open files with lsof +L1 and either restart the process or truncate the file.
When inode exhaustion is the root cause, locate directories with massive file counts and clean mail queues, PHP sessions, or other small‑file spammers.
Adjust ext4 reserved space with tune2fs -m only as an emergency measure.
Automate regular cleanup via the provided bash script and schedule it with cron or a systemd timer.
Implement proactive monitoring (node_exporter, Prometheus alerts, Grafana dashboards) to catch usage before it reaches critical thresholds.
Adopt good partition planning, proper log‑rotation, per‑user quotas, and continuous monitoring to keep disk‑related outages at bay.
Further Reading
Linux Filesystem Hierarchy Standard – https://refspecs.linuxfoundation.org/FHS_3.0/fhs/index.html
Prometheus Node Exporter – https://github.com/prometheus/node_exporter
ncdu Documentation – https://dev.yorhel.nl/ncdu
logrotate Manual – man logrotate
systemd‑tmpfiles Manual – man tmpfiles.d
MaGe Linux Operations
Founded in 2009, MaGe Education is a top Chinese high‑end IT training brand. Its graduates earn 12K+ RMB salaries, and the school has trained tens of thousands of students. It offers high‑pay courses in Linux cloud operations, Python full‑stack, automation, data analysis, AI, and Go high‑concurrency architecture. Thanks to quality courses and a solid reputation, it has talent partnerships with numerous internet firms.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
