Operations 18 min read

How to Diagnose and Resolve Complex Server Outages: A Step‑by‑Step Ops Guide

This article walks through a systematic, multi‑stage approach to identifying, reproducing, and fixing server‑side problems using Linux tools such as strace, lsof, and netstat, illustrated with real‑world case studies and practical command examples.

Efficient Ops

Jun 23, 2016

How to Diagnose and Resolve Complex Server Outages: A Step‑by‑Step Ops Guide

How to Handle System Issues: A Structured Troubleshooting Process

When a system shows abnormal behavior, operators follow a repeatable workflow: discover the issue, attempt to reproduce it, examine logs, decide whether the problem can be solved by operations or requires developer involvement, and finally apply a fix.

Typical Workflow

Detect the problem through monitoring, user reports, or alerts.

Try to reproduce the issue; some bugs only appear after prolonged runtime.

Inspect error logs for critical clues.

If the cause is an environment or configuration bug, resolve it by upgrading libraries, adjusting versions, or fixing configs.

If the problem is a software bug, involve developers.

The process is illustrated below:

Problem Investigation Stages Form a Loop

Initially I only performed quick fixes without addressing root causes. After brief analysis, I kept the现场 (现场 = evidence) to separate noise from real causes, but sometimes I still fell into dead‑ends and had to seek help from online experts.

The loop is shown here:

1. How to Find the Root Cause?

I split the whole investigation into four stages; the first four steps constitute Stage 1. Below are two real cases that illustrate what I did in this stage.

Case 1: Car Forum System Anomaly

The forum showed “Server temporarily unavailable, please try again later” after a PHP upgrade and LVS migration.

The monitoring system raised an URL‑scan alert.

Checking /var/log/httpd/error.log revealed the following error (image omitted for brevity).

Case 2: Real‑Estate Backend Anomaly

After upgrading two machines to PHP 5.3.10, the service ran normally, but later the backend became inaccessible; restarting php‑fpm temporarily fixed it.

One day later the issue reappeared, and nginx error logs showed clear errors.

These logs are shown in the following screenshots.

1.1 Stage 1 – Simple Handling Does Not Solve the Problem

Both cases recovered after a restart, but the issue returned after a while. I increased the open‑file limit with ulimit -SHn 65535, yet the problem persisted.

Running cat /proc/sys/fs/file-nr showed that the system had opened over 100 000 file descriptors, close to the maximum.

Key socket statistics collected:

sockets: used – total number of protocol sockets in use.

TCP:inuse – number of listening TCP sockets.

TCP:orphan – TCP connections without an owning process.

TCP:tw – sockets in TIME_WAIT state.

TCP:alloc – allocated TCP sockets.

TCP:mem – socket buffer memory usage.

1.1.2 Interfering Factors – Preliminary Analysis

Possible causes considered:

PHP upgrade may have introduced the issue.

LVS only performs simple health checks and should not cause web‑service unavailability.

Custom .so extensions loaded by PHP added complexity and required cross‑team coordination.

1.1.3 Wrong Direction in Problem Judgment

Because of many interfering factors and limited familiarity with the system, I initially mis‑identified the root cause.

1.2 Stage 2 – Returning to the Right Path with Existing Tools

When PHP upgrade and LVS migration offered no clues, I turned to Linux built‑in tools:

strace – trace system calls and signals.

lsof – list open files.

gdb – GNU Debugger.

netstat – display network connections and statistics.

/proc – pseudo‑filesystem exposing kernel and process information.

1.2.2 Analyzing a Tricky Symptom

I examined the architecture, rolled back LVS, restored the original DNS, reverted PHP to the previous version, and checked system load, I/O pressure, and dmesg – all normal. However, httpd showed a large number of open sockets.

Using lsof -p $(pidof httpd) I listed all file descriptors of the httpd process (screenshots omitted).

1.2.4 Seeking Help from the Internet

Searches on Google yielded a standard procedure for locating “can’t identify protocol” socket leaks:

Traverse /proc/<pid>/fd and count files; if a directory contains more than 10 entries, it likely indicates a socket leak.

Restart the program to restore service.

Run strace -p <pid> >>/tmp/stracelog.log 2>&1 to capture system calls.

Check /proc/<pid>/fd for increasing file counts; stop strace when the leak is observed.

Search the log for close(socket) lines to pinpoint the offending code.

1.3 Stage 3 – Deep Dive with Analysis Tools

In the nginx + php‑cgi scenario, repeated testing showed that file descriptors 19 and 20 were created after a database connection and never closed, leading to the “can’t identify protocol” message.

Comparing with a normal open/close trace clarified the missing close() call.

1.3.4 Reproducing the Issue Consistently

Further investigation revealed that the problematic code called ioctl(SIOCGIFADDR) to obtain the IP address of eth1, and the socket created for this call was never closed.

2. How to Properly Resolve the Problem

Once the root cause is identified, the remaining steps become straightforward.

2.1 Source‑Code Analysis

Search the PHP code base for references to eth1. The function get_eth1_ip_str() resides in the custom t_common.so extension.

Inspect the extension source to confirm it simply returns the IP address of eth1.

2.2 Verifying the Hypothesis

By inserting a long sleep(10000) call, I could observe the process state and confirm that the socket remained open.

2.3 Applying a Code Patch

After fixing the extension to correctly close the socket, I reran the test program ( /usr/local/php/bin/php test.php xxx.xxx.169.114) and monitored lsof and /proc/<pid>/fd. No “can’t identify protocol” sockets appeared, and the production system remained stable.

3. Experience Summary

Collect information continuously.

Stay calm, analyze actively.

Make bold hypotheses and test them.

Summarize findings for future reference.

Additional insights:

Understand data flow and business flow; web traffic is simpler than mail traffic.

Identify key flow nodes quickly and simulate them to obtain first‑hand data.

Monitor program state via logs and tools like strace.

END.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

debugging Performance Operations system-administration

Written by

Efficient Ops

This public account is maintained by Xiaotianguo and friends, regularly publishing widely-read original technical articles. We focus on operations transformation and accompany you throughout your operations career, growing together happily.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.

How to Handle System Issues: A Structured Troubleshooting Process

Typical Workflow

Problem Investigation Stages Form a Loop

1. How to Find the Root Cause?

Case 1: Car Forum System Anomaly

Case 2: Real‑Estate Backend Anomaly

1.1 Stage 1 – Simple Handling Does Not Solve the Problem

1.1.2 Interfering Factors – Preliminary Analysis

1.1.3 Wrong Direction in Problem Judgment

1.2 Stage 2 – Returning to the Right Path with Existing Tools

1.2.2 Analyzing a Tricky Symptom

1.2.4 Seeking Help from the Internet

1.3 Stage 3 – Deep Dive with Analysis Tools

1.3.4 Reproducing the Issue Consistently

2. How to Properly Resolve the Problem

2.1 Source‑Code Analysis

2.2 Verifying the Hypothesis

2.3 Applying a Code Patch

3. Experience Summary

Efficient Ops

How this landed with the community

Was this worth your time?

0 Comments

Case 1: Car Forum System Anomaly

Case 2: Real‑Estate Backend Anomaly

1.1 Stage 1 – Simple Handling Does Not Solve the Problem

1.2 Stage 2 – Returning to the Right Path with Existing Tools

1.3 Stage 3 – Deep Dive with Analysis Tools