Operations 18 min read

Comprehensive Guide to Linux Problem Diagnosis and Troubleshooting

This article presents a systematic methodology and a curated set of Linux tools—including CPU, memory, disk I/O, network, load monitoring, and flame‑graph techniques—illustrated with a real‑world nginx case study to help engineers quickly locate and resolve performance issues.

Linux Tech Enthusiast
Linux Tech Enthusiast
Linux Tech Enthusiast
Comprehensive Guide to Linux Problem Diagnosis and Troubleshooting

Background

When monitoring plugins cannot immediately reveal the root cause of obscure Linux problems, deeper server‑side analysis is required. Accumulated technical experience and a broad knowledge of system subsystems are essential for effective troubleshooting.

Methodology

The analysis follows the 5W2H framework:

What – describe the observed phenomenon.

When – identify when it occurs.

Why – determine why it happens.

Where – locate the problematic component.

How much – quantify resource consumption.

How to do – propose remediation steps.

CPU Analysis

Key concepts include on‑CPU vs. off‑CPU time, processor, core, hardware thread, cache, CPI/IPC, scheduler, run queue, preemption, multi‑process/thread, and instruction length.

Tools:

uptime, vmstat, mpstat, top, pidstat – basic CPU/load metrics.

perf – per‑function CPU usage, including kernel functions.

// View overall CPU usage
 top

// Show per‑core statistics
 mpstat -P ALL 1

// Display CPU usage and load average
 vmstat 1

// Process‑specific CPU stats
 pidstat -u 1 -p <pid>

// Profile function‑level CPU usage for a process
 perf top -p <pid> -e cpu-clock

Memory Analysis

Important concepts: main memory, virtual memory, resident set, address space, OOM, page cache, page fault, swapping, allocator libraries (libc, glibc, libmalloc, mtmalloc), and the kernel SLUB allocator.

Tools:

free, vmstat, top, pidstat, pmap – memory usage statistics.

valgrind – memory leak detection.

dtrace – dynamic tracing of kernel functions via D scripts.

// Show system memory usage
 free -m

// Virtual memory statistics
 vmstat 1

// System memory overview
 top

// Process memory statistics (1‑second interval)
 pidstat -p <pid> -r 1

// Process memory map details
 pmap -d <pid>

// Detect memory leaks with valgrind
 valgrind --tool=memcheck --leak-check=full --log-file=./log.txt ./program

Disk I/O Analysis

Disk subsystems are common performance bottlenecks due to mechanical latency. Understanding file systems, VFS, page cache, buffer cache, inode structures, and I/O schedulers is essential.

Tools:

iotop – real‑time I/O monitoring.

iostat – detailed I/O statistics.

pidstat – per‑process I/O.

perf – trace block I/O events.

// Monitor I/O activity
 iotop

// Detailed I/O stats (10 samples)
 iostat -d -x -k 1 10

// Process‑level I/O info
 pidstat -d 1 -p <pid>

// Trace block request issues
 perf record -e block:block_rq_issue -ag
 perf report

Network Analysis

Network monitoring is complex because latency, blocking, collisions, packet loss, and external equipment (routers, switches, wireless) can affect measurements. Modern adaptive NICs adjust automatically to varying link conditions.

Tools:

netstat – socket statistics.

ss – socket summaries.

sar – network I/O and TCP/ETCP stats.

tcpdump – packet capture.

tcpflow – flow‑level capture.

// Show network statistics
 netstat -s

// List current UDP connections
 netstat -nu

// Show UDP port usage
 netstat -apu

// Count connections per state
 netstat -a | awk '/^tcp/ {++S[$NF]} END {for(a in S) print a, S[a]}'

// List all TCP connections
 ss -t -a

// Show socket summary
 ss -s

// Show all UDP sockets
 ss -u -a

// TCP/ETCP stats
 sar -n TCP,ETCP 1

// Network device I/O
 sar -n DEV 1

// Capture packets to a specific host/port
 tcpdump -i eth1 host 192.168.1.1 and port 80

// Capture flow data
 tcpflow -cp host 192.168.1.1

System Load

Load measures the amount of work the system is doing, expressed as the length of the run‑queue. Load average reports the average over 1, 5, and 15 minutes.

// View load information
 uptime
 top
 vmstat

// Summarize system call latency
 strace -c -p <pid>

// Trace specific syscalls (e.g., epoll_wait)
 strace -T -e epoll_wait -p <pid>

// Show kernel logs
 dmesg

Flame Graphs

Flame graphs (created by Brendan Gregg) visualize CPU call stacks. The y‑axis shows stack depth; the x‑axis shows sample count; the width of a box indicates the proportion of samples for that function.

Types include on‑CPU, off‑CPU, memory, hot/cold, and differential graphs.

Installation

// Install systemtap and runtime
 yum install systemtap systemtap-runtime

// Install kernel debug info matching the running kernel (example kernel version 2.6.18‑308.el5)
 debuginfo-install --enablerepo=debuginfo search kernel
 debuginfo-install --enablerepo=debuginfo search glibc

Clone demo repository

// Clone the repository containing flame‑graph scripts
 git clone https://github.com/lidaohang/quick_location.git
 cd quick_location

On‑CPU Flame Graph

// Generate user‑space on‑CPU graph
 sh ngx_on_cpu_u.sh <pid>
 cd ngx_on_cpu_u
 python -m SimpleHTTPServer 8088
 # Open http://127.0.0.1:8088/<pid>.svg

// Generate kernel on‑CPU graph
 sh ngx_on_cpu_k.sh <pid>
 cd ngx_on_cpu_k
 python -m SimpleHTTPServer 8088
 # Open http://127.0.0.1:8088/<pid>.svg

Off‑CPU Flame Graph

// Generate user‑space off‑CPU graph
 sh ngx_off_cpu_u.sh <pid>
 cd ngx_off_cpu_u
 python -m SimpleHTTPServer 8088
 # Open http://127.0.0.1:8088/<pid>.svg

// Generate kernel off‑CPU graph
 sh ngx_off_cpu_k.sh <pid>
 cd ngx_off_cpu_k
 python -m SimpleHTTPServer 8088
 # Open http://127.0.0.1:8088/<pid>.svg

Memory‑Level Flame Graph

// Generate memory‑level flame graph
 sh ngx_on_memory.sh <pid>
 cd ngx_on_memory
 python -m SimpleHTTPServer 8088
 # Open http://127.0.0.1:8088/<pid>.svg

Differential (Red‑Blue) Flame Graph

// Capture baseline profile
 perf record -F 99 -p <pid> -g -- sleep 30
 perf script > out.stacks1

// Capture changed profile
 perf record -F 99 -p <pid> -g -- sleep 30
 perf script > out.stacks2

// Generate folded stacks
 ./FlameGraph/stackcollapse-perf.pl out.stacks1 > out.folded1
 ./FlameGraph/stackcollapse-perf.pl out.stacks2 > out.folded2

// Produce diff flame graph
 ./FlameGraph/difffolded.pl out.folded1 out.folded2 | ./FlameGraph/flamegraph.pl > diff2.svg

Case Study: Nginx Cluster Anomaly (2017‑09‑25)

Monitoring reported a surge of 499 and 5xx responses and elevated CPU usage on the nginx cluster.

Metric Analysis

Request Volume – charts showed no spike; traffic actually decreased.

Response Time – increased, possibly due to nginx itself or upstream latency.

Upstream Response – upstream latency grew, suggesting backend delay affected nginx.

CPU Observation – top indicated high CPU usage by nginx workers.

Process‑Level CPU – perf top -p <pid> revealed most time spent in free, malloc, and JSON parsing.

Flame Graph – user‑CPU flame graph highlighted heavy JSON parsing as a hotspot.

Conclusion

The traffic anomaly stemmed from prolonged upstream response times, while the CPU bottleneck originated from intensive JSON parsing and memory allocation within nginx.

Resolution

The immediate fix was to disable the high‑CPU module, which lowered CPU usage and restored normal request flow. The upstream delay persisted because the backend service looped back to nginx.

LinuxFlame GraphTroubleshootingCPUMemoryPerformance Analysis
Linux Tech Enthusiast
Written by

Linux Tech Enthusiast

Focused on sharing practical Linux technology content, covering Linux fundamentals, applications, tools, as well as databases, operating systems, network security, and other technical knowledge.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.