Essential Linux Ops: Proven Troubleshooting Steps for Common Failures
This guide outlines a systematic Linux operations troubleshooting framework—emphasizing error messages, log analysis, root‑cause isolation, and step‑by‑step solutions for six real‑world scenarios ranging from filesystem corruption to inode exhaustion and read‑only file‑system errors.
As a professional Linux operations engineer, having a clear troubleshooting framework is essential for quickly pinpointing and resolving issues.
First, pay close attention to error messages, as they often indicate where the problem lies.
Next, examine logs—both system logs (e.g., files under /var/log) and application‑specific logs—to gain deeper insight.
Then analyze and locate the root cause by correlating error messages, log entries, and contextual information.
Finally, address the core issue once its cause is identified.
Scenario 1: Filesystem corruption prevents boot
When the root filesystem shows errors on /dev/sda6, force a check and repair:
<code># umount /dev/sda6
# fsck.ext3 -y /dev/sda6
</code>Scenario 2: “Argument list too long” and disk‑full errors
Saving a crontab can fail with “no space left on device” because /var is full. Use
df -hto confirm, then identify large directories with
du. Deleting files in /var/spool/clientmqueue frees space:
<code># rm /var/spool/clientmqueue/*
</code>For “argument list too long”, you can delete files in batches, use
find, or write a shell script:
<code># rm [a-n]* -rf
# rm [o-z]* -rf
# find /var/spool/clientmqueue -type f -print -exec rm -f {} \;
#!/bin/bash
RM_DIR='/var/spool/clientmqueue'
cd $RM_DIR
for i in $(ls); do
rm -f "$i"
done
</code>Scenario 3: Inode exhaustion causing service failure
An Oracle database may refuse to start despite sufficient disk space because inodes are exhausted. Check with
df -iand clear inode‑heavy files:
<code># df -i
# dumpe2fs -h /dev/sda3 | grep 'Inode count'
# find /var/spool/clientmqueue/ -name "*" -exec rm -rf {} \;
</code>Scenario 4: Deleted files still occupying space
If large files are removed from /tmp but space is not released, an httpd process may still hold the deleted
access_log. Identify the open file and truncate or restart the service:
<code># lsof | grep delete
# echo "" > /tmp/access_log
</code>Scenario 5: “Too many open files” in a Java web app
Check the file‑descriptor limit with
ulimit -nand set appropriate values in
/etc/security/limits.conf. Remember to restart Tomcat after changing limits.
Scenario 6: “Read‑only file system” error
When a filesystem becomes read‑only due to corruption, unmount the affected partition, stop any processes using it, run
fsck, and remount:
<code># fuser -m /dev/sdb1
# ps -ef | grep <PID>
# kill <PID>
# umount /dev/sdb1
# fsck -V -a /dev/sdb1
# mount /dev/sdb1 /www/data
</code>Efficient Ops
This public account is maintained by Xiaotianguo and friends, regularly publishing widely-read original technical articles. We focus on operations transformation and accompany you throughout your operations career, growing together happily.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.