Operations 11 min read

Essential Checklist for Rapid Server Troubleshooting

This guide walks you through a systematic, step‑by‑step process for diagnosing and resolving poor‑performance or failure incidents on Linux servers, covering everything from gathering context and checking who is logged in to inspecting processes, network services, hardware, I/O, logs, cron jobs and application‑level diagnostics.

Programmer DD

Nov 23, 2019

Essential Checklist for Rapid Server Troubleshooting

1. Understand the problem context

Gather all known information about the server and the specific failure before jumping to the console.

What is the symptom? No response, error messages, etc.

When was the issue first observed?

Can it be reproduced?

Is there a pattern (e.g., hourly)?

What was the last change to the platform (code, server, etc.)?

Which user group is affected?

Is documentation of the physical and logical architecture available?

Is a monitoring system (Munin, Zabbix, Nagios, New Relic…) accessible?

Are logs (Loggly, Airbrake, Graylog…) available?

2. Identify who is online

$ w $ last

Use these commands to see logged‑in users and recent logins, avoiding interference with other users.

3. Review recent command history

$ history

Inspect previously executed commands and correlate them with user information; consider setting HISTTIMEFORMAT to show timestamps.

4. Examine running processes

$ pstree -a $ ps aux

pstree -a

gives a clear view of processes and owners, while ps aux provides detailed information.

5. Check listening network services

$ netstat -ntlp $ netstat -nlp $ netstat -nulp

Run the commands separately to avoid overwhelming output; avoid the numeric option if you prefer symbolic addresses.

6. Inspect CPU and memory usage

$ free -m $ uptime $ top $ htop

Is there free memory or swap activity?

Are CPU cores overloaded?

What is the overall load average and its source?

7. Review hardware details

$ lspci $ dmidecode $ ethtool

Check RAID cards, CPU, memory slots, NIC settings, and any hardware errors.

8. Evaluate I/O performance

$ iostat -kx 2 $ vmstat 2 10 $ mpstat 2 10 $ dstat --top-io

These tools help pinpoint disk, memory and CPU bottlenecks.

9. Inspect mount points and file systems

$ mount $ cat /etc/fstab $ vgs $ pvs $ lvs $ df -h $ lsof +D /*

How many file systems are mounted?

Are there dedicated file systems for services?

What mount options are used (noatime, read‑only, etc.)?

Is there sufficient disk space?

Are there large deleted files still consuming space?

10. Check kernel, interrupts and network settings

$ sysctl -a | grep … $ cat /proc/interrupts $ cat /proc/net/ip_conntrack $ netstat $ ss -s

Are interrupts evenly distributed across CPUs?

What are the swap settings?

Is conntrack_max sufficient?

How are TCP timeout states configured?

11. Review system and kernel logs

$ dmesg $ less /var/log/messages $ less /var/log/secure $ less /var/log/auth

Look for error or warning messages, hardware faults, and correlate timestamps with earlier findings.

12. Examine scheduled (cron) tasks

$ ls /etc/cron* $ for user in $(cat /etc/passwd | cut -f1 -d:); do crontab -l -u $user; done

Are any tasks running too frequently?

Do hidden cron jobs exist for certain users?

Was a backup job running when the failure occurred?

13. Analyze application logs

Check logs of web servers (Apache/Nginx), databases (MySQL), PHP‑FPM, Varnish, HA‑Proxy, etc., focusing on 5xx errors, slow queries, or connection limits.

Conclusion

After following these steps you should know which services run on the server, whether the issue relates to I/O, hardware, network or configuration, and you may have identified the root cause or at least gathered enough evidence to continue deeper investigation.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Monitoring performance Operations Linux server troubleshooting

Written by

Programmer DD

A tinkering programmer and author of "Spring Cloud Microservices in Action"

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.