Essential Ops Troubleshooting: 10 Quick Fixes and 22 Common Failure Cases
This guide compiles the most frequent Linux and network problems faced by operations engineers—ranging from non‑executing shell scripts and cron output issues to read‑only filesystems, disk space leaks, and service start failures—providing clear causes and step‑by‑step solutions for each case.
As an operations engineer, you often encounter various issues; summarizing common faults and solutions helps develop good habits.
10 Problem‑Solving Techniques
1. Shell script does not execute
Problem: Script reports ":bad interpreter: No such file or directory".
Cause: The script was edited on Windows, leaving CRLF line endings (\r) that appear as ^M on Linux.
Solution:
Rewrite the script directly on Linux, or
Run
vi:%s/\r//gand
vi:%s/^M//g(enter ^M with Ctrl+V, Ctrl+M) to remove the stray characters.
Tip: Use sh -x script.sh to execute step‑by‑step and see where it fails.
2. Controlling crontab output
Problem:
/var/spool/clientmqueuegrows beyond 100 GB.
Cause: Cron jobs produce output that is mailed to the cron user; sendmail is not running, so the mail files accumulate.
Solution:
Manually delete the files:
ls | xargs rm -f, or
Append
>/dev/null 2>&1to cron commands to discard output.
3. Telnet/SSH is slow
Problem: Telnet from host 10.50 to 10.52 is very slow, while ping works.
Cause: Reverse DNS lookup fails because the nameserver is not reachable.
Solution:
Add the correct hostname‑IP mapping to
/etc/hosts, and
Comment out the non‑working nameserver in
/etc/resolv.confor use a functional one.
4. Read‑only filesystem error
Problem: MySQL fails to create a table, reporting "ERROR 1005 (HY000): Can't create table … (errno: 30)".
Cause: Underlying OS reports error code 30 – read‑only filesystem, possibly due to filesystem corruption, bad disk sectors, or incorrect
fstabentries.
Solution:
Reboot the test machine to recover, or
Remount the filesystem with write permissions (e.g.,
mount -o remount,rw /dev/sdX).
5. Deleted file does not free disk space
Problem:
df -hshows 90 GB used, but
du -sh /*totals only 30 GB.
Cause: A process still holds an open file descriptor to a deleted file.
Solution:
Restart the system or the affected service, or
Identify the holding process with
/usr/sbin/lsof | grep deletedand release the space by closing the file descriptor, e.g.,
echo > /proc/25575/fd/33, or kill the process.
6. Improving performance of find cleanup
Problem: A nightly
find /tmp -name "picture_*" -mtime +1 -exec rm -f {}script causes high load.
Cause: Scanning a directory with many files is resource‑intensive.
Solution: Change to the directory first and use faster shell constructs, e.g.:
<code>#!/bin/sh
cd /tmp
time=$(date -d "2 days ago" "+%b%d")
ls -l | grep "picture" | awk '{print $NF}' | xargs rm -rf</code>7. Unable to obtain gateway MAC address
Problem: ARP table shows incomplete entry for the gateway.
Solution: Bind the correct MAC address manually, e.g.,
arp -s 192.168.3.254 00:5e:00:01:64.
8. HTTP service fails to start
Problem: Starting
httpdreports address already in use on port 7080.
Cause: Port 7080 is defined in multiple configuration files (
/etc/httpd/conf/http.confand
/etc/httpd/conf.d/t.10086.cn.conf).
Solution: Comment out the duplicate
Listen 7080line in the second file and restart the service.
9. "Too many open files" error
Problem: Applications hit the "too many open files" limit.
Solution: Increase limits in
/etc/security/limits.confand
/root/.bash_profile:
<code>* soft nproc 65535
* hard nproc 65535
* soft nofile 65535
* hard nofile 65535
ulimit -n 65535
ulimit -u 65535</code>Reboot or run the
ulimitcommands.
10. ibdata1 and mysql‑bin logs consume disk space
Problem: ibdata1 >120 GB and mysql‑bin >80 GB fill the disk.
Cause: InnoDB shared tablespace grows without automatic shrinkage; binary logs accumulate.
Solution:
Dump and recreate the database to shrink ibdata1.
Manually purge old binary logs:
PURGE MASTER LOGS TO 'mysql-bin.010';or
PURGE MASTER LOGS BEFORE '2010-12-22 13:00:00';Set
expire_logs_days=30in
/etc/my.cnffor automatic cleanup.
22 Common Failure Cases
1. Linux installer cannot find hard disk. Fix: Enter the BIOS/COMS settings and set the disk mode to compatible.
2. Installation stops after partitioning. Fix: Ensure both root and swap partitions are created.
3. Missing or unwanted packages after installation. Fix: Gain deeper Linux knowledge and reinstall as needed.
4. Proxy server filter rules not taking effect. Fix: Verify module loading, correct default policies, correct iptables syntax, and rule order.
5. After proxy/firewall setup, Internet works but DMZ services do not. Fix: Disable iptables temporarily to test; adjust rules if needed.
6. iptables rules disappear after service restart. Fix: Set
IPTABLES_SAVE_ON_RESTART="yes"in
/etc/sysconfig/iptables-configand save rules with
iptables-save > /etc/sysconfig/iptables.
7. VLAN cannot access external network. Fix: Configure the correct gateway for the VLAN.
8. named service fails to start. Fix: Ensure required files exist in
/etc/namedand
/var/named, and that the named user has proper permissions.
9. DNS resolution fails. Fix: Check forward/reverse zone files,
/etc/named.confsyntax, bind‑chroot locations, and
/etc/resolv.confnameserver entries.
10. dhcpd reports "No subnet declaration for eth0". Fix: Assign an IP to eth0 that falls within a defined DHCP subnet.
11. Multiple DHCP scopes but only one distributes addresses. Fix: Provide a separate network interface for each scope (eth0, eth1, eth2) or use a super‑scope.
12. MySQL installation fails due to dependency issues. Fix: Install required libraries and follow the dependency chain in the correct order.
13. Web service returns no page despite connection. Fix: Correct the
DocumentRootpath in
httpd.conf(remove trailing slash).
14. Remote client cannot access Samba share. Fix: Disable iptables.
15. Samba returns "NT_STATUS_BAD_NETWORK_NAME". Fix: Ensure the shared directory exists.
16. Samba returns "NT_STATUS_ACCESS_DENIED". Fix: Verify username/password and disable firewall if needed.
17. Samba returns "NT_STATUS_LOGON_FAILURE". Fix: Grant the user access to the share.
18. FTP upload rejected. Fix: Grant write permission on the target directory for the FTP user.
19. root cannot log into FTP ("500 OOPS: cannot change directory:/root"). Fix: Disable SELinux or set
SELINUX=disabledin
/etc/selinux/config.
20. Mail client can send but not receive mail. Fix: Ensure the POP3 service is running.
21. NFS mount hangs. Fix: Start the
portmapservice.
22. NFS mount works locally but not from other clients. Fix: Disable iptables on the server.
Efficient Ops
This public account is maintained by Xiaotianguo and friends, regularly publishing widely-read original technical articles. We focus on operations transformation and accompany you throughout your operations career, growing together happily.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.