Operations 10 min read

How a Cloud Ops Engineer Rescued a Critical Service from Disk‑Full Disaster

A senior cloud operations engineer receives a P1 alert for a web‑gateway server nearing 100% /var disk usage, then systematically logs in, diagnoses the log‑file bloat with df, du, and tail, truncates the offending debug log, and implements post‑mortem fixes to prevent recurrence.

Open Source Linux

Nov 17, 2025

How a Cloud Ops Engineer Rescued a Critical Service from Disk‑Full Disaster

At 6:30 am, Chen Lin, a cloud‑computing operations engineer, is awakened by his internal "ops intuition" and begins his day monitoring a large online service cluster. By 9 am, after a morning stand‑up, a critical alert fires:

告警：[P1-紧急] 服务：web-gateway-03 实例：i-0a1b2c3d4e5f6
指标：/var 磁盘使用率 > 95%
状态：CRITICAL

The alert indicates that the /var partition on the web‑gateway server is almost full, threatening service collapse.

Step 1 – Login and Verify (SSH & df)

Chen SSHs into the host:

# SSH login
ssh [email protected]

He runs df -h to confirm the disk state, seeing /dev/sda2 /var 100G 96G 4G 96%.

Step 2 – Locate the Culprit (du)

Using sudo du -sh /var/*, he discovers that /var/log consumes 84 GB, while other directories are small.

4.0K /var/cache
12G  /var/lib
84G  /var/log
4.0K /var/mail

Step 3 – Drill Down (du & ls)

Inside /var/log, sudo du -sh * shows an 82 GB app/ directory. He lists its contents and finds a massive gateway-debug.log (82 GB) while other logs are tiny.

-rw-r--r-- 1 app_user app_group 82G gateway-debug.log
-... 1.5M gateway-info.log
-... 2.2M gateway-error.log

Step 4 – Inspect the Log (tail & grep)

Because the file is too large to cat, he uses tail -n 100 gateway-debug.log to view the last lines, revealing repetitive DEBUG heartbeat messages from the payment service, confirming that DEBUG logging was left enabled in production.

Step 5 – Emergency Mitigation (truncate)

To free space without stopping the process, he runs:

# truncate the file to zero length
sudo truncate -s 0 gateway-debug.log

After truncation, df -h shows /var usage dropping from 96 % to 14 % and the alert clears.

Step 6 – Post‑mortem and Hotfix

Chen instructs the teammate to revert the log level from DEBUG to INFO in the production config and redeploy the web‑gateway cluster. He also adds a logrotate rule for gateway-debug.log to compress and retain logs for only three days.

After‑Action Review

Event: web-gateway-03 disk‑96% alert.

Root Cause: Developer mistakenly set production log level to DEBUG, causing massive heartbeat logs.

Resolution Steps: ssh → df → du → tail → truncate.

Improvements:

Enforce code‑review and CI/CD checks to prevent DEBUG logs in production.

Apply logrotate policies for all applications to avoid disk exhaustion.

The incident is resolved within 15 minutes, and Chen returns to his regular ticket queue, having documented the event for future reference.

Linux cloud operations disk space SSH TRUNCATE tail du logrotate df P1 incident

Written by

Open Source Linux

Focused on sharing Linux/Unix content, covering fundamentals, system development, network programming, automation/operations, cloud computing, and related professional knowledge.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.

Step 1 – Login and Verify (SSH & df)

Step 2 – Locate the Culprit (du)

Step 3 – Drill Down (du & ls)

Step 4 – Inspect the Log (tail & grep)

Step 5 – Emergency Mitigation (truncate)

Step 6 – Post‑mortem and Hotfix

After‑Action Review

Open Source Linux

How this landed with the community

Was this worth your time?

0 Comments

Step 1 – Login and Verify (SSH & df)

Step 2 – Locate the Culprit (du)

Step 3 – Drill Down (du & ls)

Step 4 – Inspect the Log (tail & grep)

Step 5 – Emergency Mitigation (truncate)

Step 6 – Post‑mortem and Hotfix