Operations 7 min read

Essential Linux Ops: Proven Troubleshooting Steps for Common Failures

This guide outlines a systematic Linux operations troubleshooting framework—emphasizing error messages, log analysis, root‑cause isolation, and step‑by‑step solutions for six real‑world scenarios ranging from filesystem corruption to inode exhaustion and read‑only file‑system errors.

Efficient Ops

Mar 11, 2024

Essential Linux Ops: Proven Troubleshooting Steps for Common Failures

As a professional Linux operations engineer, having a clear troubleshooting framework is essential for quickly pinpointing and resolving issues.

First, pay close attention to error messages, as they often indicate where the problem lies.

Next, examine logs—both system logs (e.g., files under /var/log) and application‑specific logs—to gain deeper insight.

Then analyze and locate the root cause by correlating error messages, log entries, and contextual information.

Finally, address the core issue once its cause is identified.

Scenario 1: Filesystem corruption prevents boot

When the root filesystem shows errors on /dev/sda6, force a check and repair:

# umount /dev/sda6
# fsck.ext3 -y /dev/sda6

Scenario 2: “Argument list too long” and disk‑full errors

Saving a crontab can fail with “no space left on device” because /var is full. Use df -h to confirm, then identify large directories with du. Deleting files in /var/spool/clientmqueue frees space:

# rm /var/spool/clientmqueue/*

For “argument list too long”, you can delete files in batches, use find, or write a shell script:

# rm [a-n]* -rf
# rm [o-z]* -rf
# find /var/spool/clientmqueue -type f -print -exec rm -f {} \;
#!/bin/bash
RM_DIR='/var/spool/clientmqueue'
cd $RM_DIR
for i in $(ls); do
    rm -f "$i"
done

Scenario 3: Inode exhaustion causing service failure

An Oracle database may refuse to start despite sufficient disk space because inodes are exhausted. Check with df -i and clear inode‑heavy files:

# df -i
# dumpe2fs -h /dev/sda3 | grep 'Inode count'
# find /var/spool/clientmqueue/ -name "*" -exec rm -rf {} \;

Scenario 4: Deleted files still occupying space

If large files are removed from /tmp but space is not released, an httpd process may still hold the deleted access_log. Identify the open file and truncate or restart the service:

# lsof | grep delete
# echo "" > /tmp/access_log

Scenario 5: “Too many open files” in a Java web app

Check the file‑descriptor limit with ulimit -n and set appropriate values in /etc/security/limits.conf. Remember to restart Tomcat after changing limits.

Scenario 6: “Read‑only file system” error

When a filesystem becomes read‑only due to corruption, unmount the affected partition, stop any processes using it, run fsck, and remount:

# fuser -m /dev/sdb1
# ps -ef | grep <PID>
# kill <PID>
# umount /dev/sdb1
# fsck -V -a /dev/sdb1
# mount /dev/sdb1 /www/data

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Operations Linux Shell Commands

Written by

Efficient Ops

This public account is maintained by Xiaotianguo and friends, regularly publishing widely-read original technical articles. We focus on operations transformation and accompany you throughout your operations career, growing together happily.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.