How a Cloud Ops Engineer Rescued a Critical Service from Disk‑Full Disaster
A senior cloud operations engineer receives a P1 alert for a web‑gateway server nearing 100% /var disk usage, then systematically logs in, diagnoses the log‑file bloat with df, du, and tail, truncates the offending debug log, and implements post‑mortem fixes to prevent recurrence.
At 6:30 am, Chen Lin, a cloud‑computing operations engineer, is awakened by his internal "ops intuition" and begins his day monitoring a large online service cluster. By 9 am, after a morning stand‑up, a critical alert fires:
告警:[P1-紧急] 服务:web-gateway-03 实例:i-0a1b2c3d4e5f6
指标:/var 磁盘使用率 > 95%
状态:CRITICALThe alert indicates that the /var partition on the web‑gateway server is almost full, threatening service collapse.
Step 1 – Login and Verify (SSH & df)
Chen SSHs into the host:
# SSH login
ssh [email protected]He runs df -h to confirm the disk state, seeing /dev/sda2 /var 100G 96G 4G 96%.
Step 2 – Locate the Culprit (du)
Using sudo du -sh /var/*, he discovers that /var/log consumes 84 GB, while other directories are small.
4.0K /var/cache
12G /var/lib
84G /var/log
4.0K /var/mailStep 3 – Drill Down (du & ls)
Inside /var/log, sudo du -sh * shows an 82 GB app/ directory. He lists its contents and finds a massive gateway-debug.log (82 GB) while other logs are tiny.
-rw-r--r-- 1 app_user app_group 82G gateway-debug.log
-... 1.5M gateway-info.log
-... 2.2M gateway-error.logStep 4 – Inspect the Log (tail & grep)
Because the file is too large to cat, he uses tail -n 100 gateway-debug.log to view the last lines, revealing repetitive DEBUG heartbeat messages from the payment service, confirming that DEBUG logging was left enabled in production.
Step 5 – Emergency Mitigation (truncate)
To free space without stopping the process, he runs:
# truncate the file to zero length
sudo truncate -s 0 gateway-debug.logAfter truncation, df -h shows /var usage dropping from 96 % to 14 % and the alert clears.
Step 6 – Post‑mortem and Hotfix
Chen instructs the teammate to revert the log level from DEBUG to INFO in the production config and redeploy the web‑gateway cluster. He also adds a logrotate rule for gateway-debug.log to compress and retain logs for only three days.
After‑Action Review
Event: web-gateway-03 disk‑96% alert.
Root Cause: Developer mistakenly set production log level to DEBUG, causing massive heartbeat logs.
Resolution Steps: ssh → df → du → tail → truncate.
Improvements:
Enforce code‑review and CI/CD checks to prevent DEBUG logs in production.
Apply logrotate policies for all applications to avoid disk exhaustion.
The incident is resolved within 15 minutes, and Chen returns to his regular ticket queue, having documented the event for future reference.
Open Source Linux
Focused on sharing Linux/Unix content, covering fundamentals, system development, network programming, automation/operations, cloud computing, and related professional knowledge.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
