How to Diagnose Server Failures Within the First 5 Minutes
This guide walks you through a systematic, step‑by‑step process for quickly identifying the root cause of a server outage, covering everything from gathering context and checking who is logged in to inspecting processes, network services, hardware, I/O performance, filesystem mounts, and relevant logs.
1. Clarify the Problem Context
Before touching the server, collect all known facts about the incident: what symptoms appear (no response, errors), when the issue was first noticed, reproducibility, any patterns (e.g., hourly), recent platform changes, affected user groups, available infrastructure documentation, and whether monitoring or log services (Munin, Zabbix, New Relic, Loggly, Graylog, etc.) are accessible.
2. Identify Who Is Logged In
$ w
$ lastThese commands show current and recent users. Run them when no other users are actively working to avoid interference.
3. Review Command History
$ historyInspect recent commands executed on the server. Consider setting HISTTIMEFORMAT to include timestamps for better correlation.
4. Examine Running Processes
$ pstree -a
$ ps aux ps auxprovides a detailed list, while pstree -a gives a clearer hierarchical view of processes and their owners.
5. List Listening Network Services
$ netstat -ntlp
$ netstat -nulp
$ netstat -nxlpRun the three commands separately to avoid an overwhelming output. Avoid the -n numeric‑only option only when you prefer IP addresses for readability.
6. Check CPU and Memory Usage
$ free -m
$ uptime
$ top
$ htopAsk whether free memory remains, if swapping occurs, whether CPU cores are saturated, and what the overall load average looks like.
7. Inspect Hardware Details
$ lspci
$ dmidecode
$ ethtoolIdentify RAID cards, CPU model, empty memory slots, and verify NIC settings (duplex mode, speed, TX/RX errors).
8. Evaluate I/O Performance
$ iostat -kx 2
$ vmstat 2 10
$ mpstat 2 10
$ dstat --top-io --top-bioUse these tools to detect disk saturation, swap activity, and which processes (e.g., MySQL, PHP) consume the most I/O.
9. Review Mount Points and Filesystems
$ mount
$ cat /etc/fstab
$ vgs
$ pvs
$ lvs
$ df -h
$ lsof +D /Determine the number of mounted filesystems, special service filesystems, mount options (noatime, defaults), remaining disk space, and whether large deleted files still occupy space.
10. Check System Logs and Kernel Messages
$ dmesg
$ less /var/log/messages
$ less /var/log/secure
$ less /var/log/authLook for error or warning messages, hardware failures, or filesystem issues, and try to correlate timestamps with earlier findings.
11. Analyse Application‑Specific Logs
Focus on obvious problems in typical LAMP stacks:
Apache/Nginx: search for 5xx errors and limit_zone issues.
MySQL: inspect mysql.log for corruption, InnoDB recovery, or query bottlenecks.
PHP‑FPM: enable and review slow‑log entries.
Varnish: check varnishlog and varnishstat for hit/miss ratios.
HA‑Proxy: verify backend health checks and queue sizes.
Conclusion
After spending about five minutes following these steps you should know what processes are running, which subsystem (I/O, hardware, network, or configuration) is likely responsible, and whether the issue matches known patterns such as excessive database indexing or too many Apache workers. This foundation lets you dig deeper or, in many cases, pinpoint the exact failure source.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
ITPUB
Official ITPUB account sharing technical insights, community news, and exciting events.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
