Operations 8 min read

How to Diagnose Server Failures Within the First 5 Minutes

This guide walks you through a systematic, step‑by‑step process for quickly identifying the root cause of a server outage, covering everything from gathering context and checking who is logged in to inspecting processes, network services, hardware, I/O performance, filesystem mounts, and relevant logs.

ITPUB

Jun 23, 2018

How to Diagnose Server Failures Within the First 5 Minutes

1. Clarify the Problem Context

Before touching the server, collect all known facts about the incident: what symptoms appear (no response, errors), when the issue was first noticed, reproducibility, any patterns (e.g., hourly), recent platform changes, affected user groups, available infrastructure documentation, and whether monitoring or log services (Munin, Zabbix, New Relic, Loggly, Graylog, etc.) are accessible.

2. Identify Who Is Logged In

$ w
$ last

These commands show current and recent users. Run them when no other users are actively working to avoid interference.

3. Review Command History

$ history

Inspect recent commands executed on the server. Consider setting HISTTIMEFORMAT to include timestamps for better correlation.

4. Examine Running Processes

$ pstree -a
$ ps aux

ps aux

provides a detailed list, while pstree -a gives a clearer hierarchical view of processes and their owners.

5. List Listening Network Services

$ netstat -ntlp
$ netstat -nulp
$ netstat -nxlp

Run the three commands separately to avoid an overwhelming output. Avoid the -n numeric‑only option only when you prefer IP addresses for readability.

6. Check CPU and Memory Usage

$ free -m
$ uptime
$ top
$ htop

Ask whether free memory remains, if swapping occurs, whether CPU cores are saturated, and what the overall load average looks like.

7. Inspect Hardware Details

$ lspci
$ dmidecode
$ ethtool

Identify RAID cards, CPU model, empty memory slots, and verify NIC settings (duplex mode, speed, TX/RX errors).

8. Evaluate I/O Performance

$ iostat -kx 2
$ vmstat 2 10
$ mpstat 2 10
$ dstat --top-io --top-bio

Use these tools to detect disk saturation, swap activity, and which processes (e.g., MySQL, PHP) consume the most I/O.

9. Review Mount Points and Filesystems

$ mount
$ cat /etc/fstab
$ vgs
$ pvs
$ lvs
$ df -h
$ lsof +D /

Determine the number of mounted filesystems, special service filesystems, mount options (noatime, defaults), remaining disk space, and whether large deleted files still occupy space.

10. Check System Logs and Kernel Messages

$ dmesg
$ less /var/log/messages
$ less /var/log/secure
$ less /var/log/auth

Look for error or warning messages, hardware failures, or filesystem issues, and try to correlate timestamps with earlier findings.

11. Analyse Application‑Specific Logs

Focus on obvious problems in typical LAMP stacks:

Apache/Nginx: search for 5xx errors and limit_zone issues.

MySQL: inspect mysql.log for corruption, InnoDB recovery, or query bottlenecks.

PHP‑FPM: enable and review slow‑log entries.

Varnish: check varnishlog and varnishstat for hit/miss ratios.

HA‑Proxy: verify backend health checks and queue sizes.

Conclusion

After spending about five minutes following these steps you should know what processes are running, which subsystem (I/O, hardware, network, or configuration) is likely responsible, and whether the issue matches known patterns such as excessive database indexing or too many Apache workers. This foundation lets you dig deeper or, in many cases, pinpoint the exact failure source.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Monitoring Operations server troubleshooting

Written by

ITPUB

Official ITPUB account sharing technical insights, community news, and exciting events.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.