Essential Checklist for Rapid Server Troubleshooting
This guide walks you through a systematic, step‑by‑step process for diagnosing and resolving poor‑performance or failure incidents on Linux servers, covering everything from gathering context and checking who is logged in to inspecting processes, network services, hardware, I/O, logs, cron jobs and application‑level diagnostics.
1. Understand the problem context
Gather all known information about the server and the specific failure before jumping to the console.
What is the symptom? No response, error messages, etc.
When was the issue first observed?
Can it be reproduced?
Is there a pattern (e.g., hourly)?
What was the last change to the platform (code, server, etc.)?
Which user group is affected?
Is documentation of the physical and logical architecture available?
Is a monitoring system (Munin, Zabbix, Nagios, New Relic…) accessible?
Are logs (Loggly, Airbrake, Graylog…) available?
2. Identify who is online
$ w $ last
Use these commands to see logged‑in users and recent logins, avoiding interference with other users.
3. Review recent command history
$ history
Inspect previously executed commands and correlate them with user information; consider setting HISTTIMEFORMAT to show timestamps.
4. Examine running processes
$ pstree -a $ ps aux
pstree -agives a clear view of processes and owners, while ps aux provides detailed information.
5. Check listening network services
$ netstat -ntlp $ netstat -nlp $ netstat -nulp
Run the commands separately to avoid overwhelming output; avoid the numeric option if you prefer symbolic addresses.
6. Inspect CPU and memory usage
$ free -m $ uptime $ top $ htop
Is there free memory or swap activity?
Are CPU cores overloaded?
What is the overall load average and its source?
7. Review hardware details
$ lspci $ dmidecode $ ethtool
Check RAID cards, CPU, memory slots, NIC settings, and any hardware errors.
8. Evaluate I/O performance
$ iostat -kx 2 $ vmstat 2 10 $ mpstat 2 10 $ dstat --top-io
These tools help pinpoint disk, memory and CPU bottlenecks.
9. Inspect mount points and file systems
$ mount $ cat /etc/fstab $ vgs $ pvs $ lvs $ df -h $ lsof +D /*
How many file systems are mounted?
Are there dedicated file systems for services?
What mount options are used (noatime, read‑only, etc.)?
Is there sufficient disk space?
Are there large deleted files still consuming space?
10. Check kernel, interrupts and network settings
$ sysctl -a | grep … $ cat /proc/interrupts $ cat /proc/net/ip_conntrack $ netstat $ ss -s
Are interrupts evenly distributed across CPUs?
What are the swap settings?
Is conntrack_max sufficient?
How are TCP timeout states configured?
11. Review system and kernel logs
$ dmesg $ less /var/log/messages $ less /var/log/secure $ less /var/log/auth
Look for error or warning messages, hardware faults, and correlate timestamps with earlier findings.
12. Examine scheduled (cron) tasks
$ ls /etc/cron* $ for user in $(cat /etc/passwd | cut -f1 -d:); do crontab -l -u $user; done
Are any tasks running too frequently?
Do hidden cron jobs exist for certain users?
Was a backup job running when the failure occurred?
13. Analyze application logs
Check logs of web servers (Apache/Nginx), databases (MySQL), PHP‑FPM, Varnish, HA‑Proxy, etc., focusing on 5xx errors, slow queries, or connection limits.
Conclusion
After following these steps you should know which services run on the server, whether the issue relates to I/O, hardware, network or configuration, and you may have identified the root cause or at least gathered enough evidence to continue deeper investigation.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Programmer DD
A tinkering programmer and author of "Spring Cloud Microservices in Action"
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
