How to Diagnose and Resolve Complex Server Outages: A Step‑by‑Step Ops Guide
This article walks through a systematic, multi‑stage approach to identifying, reproducing, and fixing server‑side problems using Linux tools such as strace, lsof, and netstat, illustrated with real‑world case studies and practical command examples.
How to Handle System Issues: A Structured Troubleshooting Process
When a system shows abnormal behavior, operators follow a repeatable workflow: discover the issue, attempt to reproduce it, examine logs, decide whether the problem can be solved by operations or requires developer involvement, and finally apply a fix.
Typical Workflow
Detect the problem through monitoring, user reports, or alerts.
Try to reproduce the issue; some bugs only appear after prolonged runtime.
Inspect error logs for critical clues.
If the cause is an environment or configuration bug, resolve it by upgrading libraries, adjusting versions, or fixing configs.
If the problem is a software bug, involve developers.
The process is illustrated below:
Problem Investigation Stages Form a Loop
Initially I only performed quick fixes without addressing root causes. After brief analysis, I kept the现场 (现场 = evidence) to separate noise from real causes, but sometimes I still fell into dead‑ends and had to seek help from online experts.
The loop is shown here:
1. How to Find the Root Cause?
I split the whole investigation into four stages; the first four steps constitute Stage 1. Below are two real cases that illustrate what I did in this stage.
Case 1: Car Forum System Anomaly
The forum showed “Server temporarily unavailable, please try again later” after a PHP upgrade and LVS migration.
The monitoring system raised an URL‑scan alert.
Checking /var/log/httpd/error.log revealed the following error (image omitted for brevity).
Case 2: Real‑Estate Backend Anomaly
After upgrading two machines to PHP 5.3.10, the service ran normally, but later the backend became inaccessible; restarting php‑fpm temporarily fixed it.
One day later the issue reappeared, and nginx error logs showed clear errors.
These logs are shown in the following screenshots.
1.1 Stage 1 – Simple Handling Does Not Solve the Problem
Both cases recovered after a restart, but the issue returned after a while. I increased the open‑file limit with ulimit -SHn 65535, yet the problem persisted.
Running cat /proc/sys/fs/file-nr showed that the system had opened over 100 000 file descriptors, close to the maximum.
Key socket statistics collected:
sockets: used – total number of protocol sockets in use.
TCP:inuse – number of listening TCP sockets.
TCP:orphan – TCP connections without an owning process.
TCP:tw – sockets in TIME_WAIT state.
TCP:alloc – allocated TCP sockets.
TCP:mem – socket buffer memory usage.
1.1.2 Interfering Factors – Preliminary Analysis
Possible causes considered:
PHP upgrade may have introduced the issue.
LVS only performs simple health checks and should not cause web‑service unavailability.
Custom .so extensions loaded by PHP added complexity and required cross‑team coordination.
1.1.3 Wrong Direction in Problem Judgment
Because of many interfering factors and limited familiarity with the system, I initially mis‑identified the root cause.
1.2 Stage 2 – Returning to the Right Path with Existing Tools
When PHP upgrade and LVS migration offered no clues, I turned to Linux built‑in tools:
strace – trace system calls and signals.
lsof – list open files.
gdb – GNU Debugger.
netstat – display network connections and statistics.
/proc – pseudo‑filesystem exposing kernel and process information.
1.2.2 Analyzing a Tricky Symptom
I examined the architecture, rolled back LVS, restored the original DNS, reverted PHP to the previous version, and checked system load, I/O pressure, and dmesg – all normal. However, httpd showed a large number of open sockets.
Using lsof -p $(pidof httpd) I listed all file descriptors of the httpd process (screenshots omitted).
1.2.4 Seeking Help from the Internet
Searches on Google yielded a standard procedure for locating “can’t identify protocol” socket leaks:
Traverse /proc/<pid>/fd and count files; if a directory contains more than 10 entries, it likely indicates a socket leak.
Restart the program to restore service.
Run strace -p <pid> >>/tmp/stracelog.log 2>&1 to capture system calls.
Check /proc/<pid>/fd for increasing file counts; stop strace when the leak is observed.
Search the log for close(socket) lines to pinpoint the offending code.
1.3 Stage 3 – Deep Dive with Analysis Tools
In the nginx + php‑cgi scenario, repeated testing showed that file descriptors 19 and 20 were created after a database connection and never closed, leading to the “can’t identify protocol” message.
Comparing with a normal open/close trace clarified the missing close() call.
1.3.4 Reproducing the Issue Consistently
Further investigation revealed that the problematic code called ioctl(SIOCGIFADDR) to obtain the IP address of eth1, and the socket created for this call was never closed.
2. How to Properly Resolve the Problem
Once the root cause is identified, the remaining steps become straightforward.
2.1 Source‑Code Analysis
Search the PHP code base for references to eth1. The function get_eth1_ip_str() resides in the custom t_common.so extension.
Inspect the extension source to confirm it simply returns the IP address of eth1.
2.2 Verifying the Hypothesis
By inserting a long sleep(10000) call, I could observe the process state and confirm that the socket remained open.
2.3 Applying a Code Patch
After fixing the extension to correctly close the socket, I reran the test program ( /usr/local/php/bin/php test.php xxx.xxx.169.114) and monitored lsof and /proc/<pid>/fd. No “can’t identify protocol” sockets appeared, and the production system remained stable.
3. Experience Summary
Collect information continuously.
Stay calm, analyze actively.
Make bold hypotheses and test them.
Summarize findings for future reference.
Additional insights:
Understand data flow and business flow; web traffic is simpler than mail traffic.
Identify key flow nodes quickly and simulate them to obtain first‑hand data.
Monitor program state via logs and tools like strace.
END.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Efficient Ops
This public account is maintained by Xiaotianguo and friends, regularly publishing widely-read original technical articles. We focus on operations transformation and accompany you throughout your operations career, growing together happily.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
