Comprehensive Guide to Linux Problem Diagnosis and Troubleshooting
This article presents a systematic methodology and a curated set of Linux tools—including CPU, memory, disk I/O, network, load monitoring, and flame‑graph techniques—illustrated with a real‑world nginx case study to help engineers quickly locate and resolve performance issues.
Background
When monitoring plugins cannot immediately reveal the root cause of obscure Linux problems, deeper server‑side analysis is required. Accumulated technical experience and a broad knowledge of system subsystems are essential for effective troubleshooting.
Methodology
The analysis follows the 5W2H framework:
What – describe the observed phenomenon.
When – identify when it occurs.
Why – determine why it happens.
Where – locate the problematic component.
How much – quantify resource consumption.
How to do – propose remediation steps.
CPU Analysis
Key concepts include on‑CPU vs. off‑CPU time, processor, core, hardware thread, cache, CPI/IPC, scheduler, run queue, preemption, multi‑process/thread, and instruction length.
Tools:
uptime, vmstat, mpstat, top, pidstat – basic CPU/load metrics.
perf – per‑function CPU usage, including kernel functions.
// View overall CPU usage
top
// Show per‑core statistics
mpstat -P ALL 1
// Display CPU usage and load average
vmstat 1
// Process‑specific CPU stats
pidstat -u 1 -p <pid>
// Profile function‑level CPU usage for a process
perf top -p <pid> -e cpu-clockMemory Analysis
Important concepts: main memory, virtual memory, resident set, address space, OOM, page cache, page fault, swapping, allocator libraries (libc, glibc, libmalloc, mtmalloc), and the kernel SLUB allocator.
Tools:
free, vmstat, top, pidstat, pmap – memory usage statistics.
valgrind – memory leak detection.
dtrace – dynamic tracing of kernel functions via D scripts.
// Show system memory usage
free -m
// Virtual memory statistics
vmstat 1
// System memory overview
top
// Process memory statistics (1‑second interval)
pidstat -p <pid> -r 1
// Process memory map details
pmap -d <pid>
// Detect memory leaks with valgrind
valgrind --tool=memcheck --leak-check=full --log-file=./log.txt ./programDisk I/O Analysis
Disk subsystems are common performance bottlenecks due to mechanical latency. Understanding file systems, VFS, page cache, buffer cache, inode structures, and I/O schedulers is essential.
Tools:
iotop – real‑time I/O monitoring.
iostat – detailed I/O statistics.
pidstat – per‑process I/O.
perf – trace block I/O events.
// Monitor I/O activity
iotop
// Detailed I/O stats (10 samples)
iostat -d -x -k 1 10
// Process‑level I/O info
pidstat -d 1 -p <pid>
// Trace block request issues
perf record -e block:block_rq_issue -ag
perf reportNetwork Analysis
Network monitoring is complex because latency, blocking, collisions, packet loss, and external equipment (routers, switches, wireless) can affect measurements. Modern adaptive NICs adjust automatically to varying link conditions.
Tools:
netstat – socket statistics.
ss – socket summaries.
sar – network I/O and TCP/ETCP stats.
tcpdump – packet capture.
tcpflow – flow‑level capture.
// Show network statistics
netstat -s
// List current UDP connections
netstat -nu
// Show UDP port usage
netstat -apu
// Count connections per state
netstat -a | awk '/^tcp/ {++S[$NF]} END {for(a in S) print a, S[a]}'
// List all TCP connections
ss -t -a
// Show socket summary
ss -s
// Show all UDP sockets
ss -u -a
// TCP/ETCP stats
sar -n TCP,ETCP 1
// Network device I/O
sar -n DEV 1
// Capture packets to a specific host/port
tcpdump -i eth1 host 192.168.1.1 and port 80
// Capture flow data
tcpflow -cp host 192.168.1.1System Load
Load measures the amount of work the system is doing, expressed as the length of the run‑queue. Load average reports the average over 1, 5, and 15 minutes.
// View load information
uptime
top
vmstat
// Summarize system call latency
strace -c -p <pid>
// Trace specific syscalls (e.g., epoll_wait)
strace -T -e epoll_wait -p <pid>
// Show kernel logs
dmesgFlame Graphs
Flame graphs (created by Brendan Gregg) visualize CPU call stacks. The y‑axis shows stack depth; the x‑axis shows sample count; the width of a box indicates the proportion of samples for that function.
Types include on‑CPU, off‑CPU, memory, hot/cold, and differential graphs.
Installation
// Install systemtap and runtime
yum install systemtap systemtap-runtime
// Install kernel debug info matching the running kernel (example kernel version 2.6.18‑308.el5)
debuginfo-install --enablerepo=debuginfo search kernel
debuginfo-install --enablerepo=debuginfo search glibcClone demo repository
// Clone the repository containing flame‑graph scripts
git clone https://github.com/lidaohang/quick_location.git
cd quick_locationOn‑CPU Flame Graph
// Generate user‑space on‑CPU graph
sh ngx_on_cpu_u.sh <pid>
cd ngx_on_cpu_u
python -m SimpleHTTPServer 8088
# Open http://127.0.0.1:8088/<pid>.svg
// Generate kernel on‑CPU graph
sh ngx_on_cpu_k.sh <pid>
cd ngx_on_cpu_k
python -m SimpleHTTPServer 8088
# Open http://127.0.0.1:8088/<pid>.svgOff‑CPU Flame Graph
// Generate user‑space off‑CPU graph
sh ngx_off_cpu_u.sh <pid>
cd ngx_off_cpu_u
python -m SimpleHTTPServer 8088
# Open http://127.0.0.1:8088/<pid>.svg
// Generate kernel off‑CPU graph
sh ngx_off_cpu_k.sh <pid>
cd ngx_off_cpu_k
python -m SimpleHTTPServer 8088
# Open http://127.0.0.1:8088/<pid>.svgMemory‑Level Flame Graph
// Generate memory‑level flame graph
sh ngx_on_memory.sh <pid>
cd ngx_on_memory
python -m SimpleHTTPServer 8088
# Open http://127.0.0.1:8088/<pid>.svgDifferential (Red‑Blue) Flame Graph
// Capture baseline profile
perf record -F 99 -p <pid> -g -- sleep 30
perf script > out.stacks1
// Capture changed profile
perf record -F 99 -p <pid> -g -- sleep 30
perf script > out.stacks2
// Generate folded stacks
./FlameGraph/stackcollapse-perf.pl out.stacks1 > out.folded1
./FlameGraph/stackcollapse-perf.pl out.stacks2 > out.folded2
// Produce diff flame graph
./FlameGraph/difffolded.pl out.folded1 out.folded2 | ./FlameGraph/flamegraph.pl > diff2.svgCase Study: Nginx Cluster Anomaly (2017‑09‑25)
Monitoring reported a surge of 499 and 5xx responses and elevated CPU usage on the nginx cluster.
Metric Analysis
Request Volume – charts showed no spike; traffic actually decreased.
Response Time – increased, possibly due to nginx itself or upstream latency.
Upstream Response – upstream latency grew, suggesting backend delay affected nginx.
CPU Observation – top indicated high CPU usage by nginx workers.
Process‑Level CPU – perf top -p <pid> revealed most time spent in free, malloc, and JSON parsing.
Flame Graph – user‑CPU flame graph highlighted heavy JSON parsing as a hotspot.
Conclusion
The traffic anomaly stemmed from prolonged upstream response times, while the CPU bottleneck originated from intensive JSON parsing and memory allocation within nginx.
Resolution
The immediate fix was to disable the high‑CPU module, which lowered CPU usage and restored normal request flow. The upstream delay persisted because the backend service looped back to nginx.
Linux Tech Enthusiast
Focused on sharing practical Linux technology content, covering Linux fundamentals, applications, tools, as well as databases, operating systems, network security, and other technical knowledge.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
