Operations 10 min read

Essential Linux Server Troubleshooting Checklist: 13 Practical Steps

When a Linux server experiences a failure, this guide walks you through a comprehensive 13‑step checklist—covering problem context, user activity, process inspection, network services, resource usage, hardware, I/O performance, logs, and scheduled tasks—to help you quickly pinpoint and resolve the root cause.

MaGe Linux Operations

Aug 5, 2014

Essential Linux Server Troubleshooting Checklist: 13 Practical Steps

When a server fault occurs, the cause is rarely obvious; start with this systematic 13‑step checklist.

1. Clarify the problem context

Identify what the failure looks like (no response, error messages), when it was first noticed, whether it can be reproduced, any patterns (e.g., hourly), recent platform changes, affected user groups, available infrastructure documentation, monitoring tools (Munin, Zabbix, Nagios, New Relic), and log sources (Loggly, Airbrake, Graylog).

2. Who is logged in? $ w$ last Check which users are currently online and who has accessed the system, avoiding debugging while others are working.

3. What happened previously? $ history Review recent commands executed on the server; consider setting HISTTIMEFORMAT to see timestamps.

4. What processes are running now?

$ pstree -a
$ ps aux

Use ps aux for detailed output and pstree -a for a clearer view of processes and their owners.

5. Which network services are listening?

$ netstat -ntlp
$ netstat -nulp
$ netstat -nxlp

Run the commands separately to avoid an overwhelming list; verify that each listening port corresponds to an expected service and PID.

6. CPU and memory status

$ free -m
$ uptime
$ top
$ htop

Check for free memory, swap activity, CPU core load, and overall system load averages.

7. Hardware inspection

$ lspci
$ dmidecode
$ ethtool

Identify RAID cards, CPU details, free memory slots, NIC configuration, duplex mode, speed, and any TX/RX errors.

8. I/O performance

$ iostat -kx 2
$ vmstat 2 10
$ mpstat 2 10
$ dstat --top-io --top-bio

Use these tools to assess disk usage, swap activity, CPU consumption by system, user, or VM processes, and identify which process (e.g., MySQL, PHP) is driving I/O.

9. Mount points and filesystems

$ mount
$ cat /etc/fstab
$ vgs
$ pvs
$ lvs
$ df -h
$ lsof +D /  # beware not to kill your box

Check the number of mounted filesystems, dedicated service filesystems, mount options, remaining disk space, and whether large deleted files still occupy space.

10. Kernel, interrupts, and network tuning

$ sysctl -a | grep …
$ cat /proc/interrupts
$ cat /proc/net/ip_conntrack
$ netstat
$ ss -s

Verify balanced interrupt distribution across CPUs, swap settings, conntrack limits, TCP timeout settings, and consider using ss for faster connection overviews.

11. System and kernel logs

$ dmesg
$ less /var/log/messages
$ less /var/log/secure
$ less /var/log/auth

Look for error or warning messages, hardware or filesystem issues, and correlate timestamps with earlier findings.

12. Scheduled tasks

$ ls /etc/cron* + cat
$ for user in $(cut -d: -f1 /etc/passwd); do crontab -l -u $user; done

Identify overly frequent jobs, hidden user crontabs, or backup tasks running during the failure.

13. Application logs

Apache/Nginx: check access and error logs for 5xx errors or limit_zone issues.

MySQL: inspect mysql.log for corruption or InnoDB repair activity.

PHP‑FPM: enable and review slow‑log for PHP, MySQL, or memcache errors.

Varnish: use varnishlog and varnishstat to check hit/miss ratios.

HA‑Proxy: verify backend health checks and queue sizes.

Conclusion

After following these steps you should know what processes are running, whether the issue relates to I/O, hardware, network, or system configuration, and you will have enough information to dig deeper and ultimately locate the root cause.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Monitoring CLI Linux server troubleshooting

Written by

MaGe Linux Operations

Founded in 2009, MaGe Education is a top Chinese high‑end IT training brand. Its graduates earn 12K+ RMB salaries, and the school has trained tens of thousands of students. It offers high‑pay courses in Linux cloud operations, Python full‑stack, automation, data analysis, AI, and Go high‑concurrency architecture. Thanks to quality courses and a solid reputation, it has talent partnerships with numerous internet firms.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.