Operations 9 min read

How to Quickly Diagnose and Resolve Disk Space Exhaustion in Production

This guide walks through a step‑by‑step process for identifying the partitions and files that fill a disk, applying temporary fixes to bring usage below critical levels, and implementing long‑term measures to prevent future disk‑full incidents in production environments.

Java Tech Enthusiast
Java Tech Enthusiast
Java Tech Enthusiast
How to Quickly Diagnose and Resolve Disk Space Exhaustion in Production

Introduction

Recently I encountered a production issue where the disk was almost full and share the troubleshooting steps.

Disk usage overview
Disk usage overview

1. Check Disk Usage and Identify the Culprit

1.1 View Disk Usage

Use df -h to see usage percentages of mounted partitions.

# View usage of all mounted partitions (focus on Use% column)

df -h

Interpret the output, paying attention to the Mounted on and Use% columns. For example, a root partition showing 92% usage is a priority.

1.2 Locate Largest Directories

Navigate to the overloaded partition and run du -h --max-depth=1 to list the size of top‑level directories.

# Change to root of overloaded partition
cd /
# Show size of each subdirectory (human‑readable, summarize, depth 1)
du -h --max-depth=1

Example output:

18G    /var
10G    /usr
5G     /home
1.5G   /opt

Drill down further into the biggest directory, e.g., /var:

# Enter /var and list second‑level directories
cd /var
du -h --max-depth=1
# If /var/log is biggest, continue:
cd /var/log
du -h --max-depth=1

1.3 Find Large Files

Search for files larger than 100 MB, especially Java logs or caches.

# Find files >100M under /var, sort by size
find /var -type f -size +100M -exec ls -lh {} \; | sort -rh -k5

2. Analyse Root Causes

2.1 Uncontrolled Log Growth

Logs written without size limits are a common cause. Typical reasons include:

Log level set to DEBUG.

Repeated errors or exception loops causing massive output.

Third‑party component logs (Tomcat, Nginx, MySQL) not being cleaned.

2.2 Java Temporary Files and Caches

Java processes may generate temporary files, cache data, or heap dump files that grow unchecked.

Temporary files in tmp/ (e.g., tomcat‑tmp, jdk‑tmp).

Application cache files such as Redis persistence, Elasticsearch data, local caches.

JVM heap dump files (*.hprof) created after OOM, each can be several gigabytes.

Root causes are missing cleanup code, unattended heap dumps, and poor cache eviction policies.

3. Temporary Mitigation

After identifying the main consumer, reduce usage below 80 % to keep the system stable.

If large log files are the culprit, back up if needed, then truncate or delete them.

# Delete app.log larger than 500M
find /var/log/app/ -name "app.log" -size +500M -exec rm -f {} \;

If deleted files are still held open (e.g., Java processes), list them with lsof | grep deleted:

# List deleted but still open files
lsof | grep deleted | awk '{print $2, $7, $9}' | sort -rh -k2

Remove old temporary files, for example files older than 7 days in /tmp:

# Remove tmp files older than 7 days
find /tmp/ -name "*-tmp-*" -mtime +7 -delete

If nothing can be removed quickly, consider attaching an extra disk or expanding the volume.

4. Root‑Cause Optimisation to Prevent Recurrence

4.1 Log Management

Adjust log level to INFO or WARN.

Deploy log‑rotation and cleanup scripts.

4.2 Code and System Practices

Fix bugs that cause infinite loops or excessive logging.

Ensure temporary files are deleted in finally blocks.

Configure JVM to write heap dumps to a dedicated large partition and limit generation (e.g., -XX:HeapDumpPath=/data/heapdump/).

Monitor critical directories and trigger alerts when size thresholds are exceeded.

5. Post‑mortem and Knowledge‑Sharing

Record the root cause of the disk‑full incident.

Verify that all Java applications have log rotation enabled.

Add monitoring metrics for log file size, temporary directory size, and heap‑dump generation.

Document the troubleshooting steps and incorporate them into the team’s incident‑response handbook.

In the reported case, a forgotten debug log caused the disk to fill; after backing up and deleting the log and fixing the code, the issue was resolved.

Final diagram
Final diagram
LinuxTroubleshootingSystem Administrationlog managementdisk spacetemporary files
Java Tech Enthusiast
Written by

Java Tech Enthusiast

Sharing computer programming language knowledge, focusing on Java fundamentals, data structures, related tools, Spring Cloud, IntelliJ IDEA... Book giveaways, red‑packet rewards and other perks await!

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.