How to Diagnose Linux Server CPU Spikes: A Practical Step‑by‑Step Guide
This article presents a systematic, evidence‑driven process for locating and resolving high CPU usage on Linux servers, covering environment preparation, layered troubleshooting from whole‑machine to thread level, concrete command examples, real‑world case studies, best‑practice recommendations, and monitoring configurations.
Overview
CPU spikes on Linux servers are rarely caused by a single metric; they usually involve a chain of factors such as request amplification, thread‑model loss of control, kernel soft‑interrupt buildup, or disk/network jitter that finally pushes the CPU into system‑state usage. The alert typically shows only CPU usage > 90% on the monitoring platform.
1. Preparation
Install the diagnostic toolkit on the target host (or ensure it is already present): procps-ng, sysstat, perf, strace, lsof, iotop, dstat, tcpdump, linux‑cpupower Enable sysstat permanently so that historical data is retained (e.g., set HISTORY=28 in /etc/sysconfig/sysstat and start the related timers).
2. Capture the Current State
Run a fixed set of commands to snapshot the system at the moment of the alarm. Save the output to a timestamped directory for later replay.
date
hostname -f
uptime
top -b -n 1 | head -40
mpstat -P ALL 1 3
vmstat 1 5
sar -u 1 5
sar -n DEV,EDEV,TCP,ETCP 1 5
free -h
df -h
dmesg -T | tail -200If the host runs containers, also collect cgroup and kubelet information:
kubectl top pod -A --containers | sort -k3 -hr | head -20
kubectl describe pod <code>pod-name</code> -n <code>namespace</code>
cat /sys/fs/cgroup/cpu.max 2>/dev/null || cat /sys/fs/cgroup/cpu/cpu.cfs_quota_us
cat /sys/fs/cgroup/cpu.stat 2>/dev/null || cat /sys/fs/cgroup/cpu/cpu.stat3. Layered Diagnosis
3.1 Whole‑Machine vs. Process
If all cores show high usage and many business processes run, look at traffic, batch jobs, or host‑level interrupt storms.
If only 1‑2 cores are hot, suspect a single‑thread hotspot, lock contention, or uneven CPU pinning.
If the machine CPU is moderate but a single process is high, jump to thread‑level analysis.
If %sys is high while the process view is quiet, investigate interrupts, network stack, or disk paths.
3.2 User‑Mode vs. System‑Mode
%usr + %nicehigh → application code hotspot (dead loop, heavy JSON/regex, etc.). %sys high → system calls, network packet processing, soft‑interrupts, disk I/O. %soft high → packet‑small‑burst, connection storms, iptables/conntrack pressure. %irq high → hardware interrupt issues. %steal high → host‑level CPU contention in virtualised environments.
3.3 Process‑Level Investigation
Identify the top CPU‑eating process and drill down:
ps -eo pid,ppid,user,psr,stat,%cpu,%mem,etime,cmd --sort=-%cpu | head -20
pidstat -u -t -p <code>PID</code> 1 5
cat /proc/<code>PID</code>/status
lsof -p <code>PID</code> | head -50Newly launched process spikes → recent deployment or config change.
Process that spikes after days → thread leak, connection leak, GC degradation.
Single thread at 100% → dead loop, spin lock, busy‑wait.
For Java, map the hot thread to its hexadecimal ID and inspect the stack:
top -H -p <code>PID</code>
printf '0x%x
' <code>TID</code>
jstack <code>PID</code> | grep -A20 <code>0xTID</code>
perf top -p <code>PID</code> -gFor Go, use pprof:
curl -s http://127.0.0.1:6060/debug/pprof/profile?seconds=30 -o cpu.pprof
go tool pprof -top cpu.pprofFor Python, check for pure‑Python loops or oversized JSON serialization:
top -H -p <code>PID</code>
strace -p <code>PID</code> -tt -T -f -c3.4 Thread‑Level Diagnosis
Thread‑level analysis often reveals the true root cause:
top -H -p <code>PID</code>
ps -Lp <code>PID</code> -o pid,tid,psr,%cpu,stat,comm --sort=-%cpu | head -20
pidstat -t -p <code>PID</code> 1 5One thread fixed at 100% → dead loop or spin lock.
Multiple threads high → thread‑pool overload or hot key.
High CPU + high context switches → lock contention or excessive wake‑ups.
Use perf for lightweight sampling (15‑30 s is usually enough):
sudo perf top -p <code>PID</code> -g
sudo perf record -F 99 -p <code>PID</code> -g -- sleep 30
sudo perf report --stdio | head -80If perf_event_paranoid blocks sampling, temporarily lower it with sysctl -w kernel.perf_event_paranoid=1 and restore after debugging.
4. Real‑World Cases
Case 1 – Java Thread Loop
Scenario: After a deployment, an order‑service’s response time rose from 30 ms to 800 ms. Overall CPU was 45 % but one core stayed at 100 %. The root cause was a newly added rule‑engine loop that repeatedly matched a regex.
Investigation commands:
top -H -p 28461
ps -Lp 28461 -o pid,tid,pcpu,comm --sort=-pcpu | head
printf '0x%x
' 28513
jstack 28461 | grep -A20 6f61
sudo perf top -p 28461 -gFindings:
Thread 28513 consumed 99 % of a single core. jstack showed the thread stuck in the rule‑engine loop. perf pinpointed the hotspot to string splitting and regex matching.
Fix:
Remove the instance from the load‑balancer.
Roll back to the previous version and verify CPU drops.
Cache the rules and pre‑compile regexes.
Add timeout guards to the loop.
Before fix: core 99 %, P99 RT 1.2 s, order timeout 8.7 %
After fix: hotspot thread gone, overall CPU 31 %, P99 RT 85 msCase 2 – Soft‑Interrupt Storm
Scenario: An API‑gateway node showed >70 % %sys while Nginx workers were idle. Requests timed out frequently.
Investigation commands:
top -b -n 1 | head -10
cat /proc/softirqs | egrep 'NET_RX|NET_TX'
sar -n DEV,EDEV,TCP,ETCP 1 5
ss -s
ethtool -S eth0 | egrep 'drop|miss|queue'Findings: NET_RX soft‑interrupts were concentrated on CPU0/1.
Processes ksoftirqd/0 and ksoftirqd/1 continuously consumed CPU.
NIC had multiple queues but IRQ affinity was uneven, causing the load to concentrate on two cores.
Remediation script (rebalance IRQs and enable irqbalance):
#!/usr/bin/env bash
set -euo pipefail
SERVICE="irqbalance"
NIC="eth0"
sudo systemctl enable --now "$SERVICE"
sudo ethtool -L "$NIC" combined 8
for irq in $(grep "$NIC" /proc/interrupts | awk -F: '{print $1}'); do
echo 0f | sudo tee "/proc/irq/${irq}/smp_affinity" > /dev/null
doneResult:
Before: sys 72 %, NET_RX skewed to CPU0/1, request timeout 5.4 %
After: sys 28 %, soft‑interrupts balanced, timeout 0.3 %5. Best Practices & Safety
5.1 Performance Optimisation
Keep sysstat running permanently (e.g., HISTORY=28) so that historical CPU, I/O and network trends are available for post‑mortem analysis.
Size thread pools according to CPU cores × 1‑2 and monitor cswch/s and nvcswch/s to avoid over‑provisioning.
For network‑heavy nodes, balance soft‑interrupts across cores (enable irqbalance and set appropriate smp_affinity).
5.2 Security Measures
Restrict execution of heavy debugging tools ( perf, strace, gdb) to a limited ops team (e.g., chmod 750 /usr/bin/perf).
Audit changes to sysctl, IRQ affinity and container CPU quotas.
Run snapshot scripts with read‑only permissions; avoid embedding automatic kill commands.
5.3 High‑Availability Considerations
Implement health‑check based traffic shedding before deep debugging.
Deploy critical services across multiple availability zones to avoid a single hot node dragging down the whole system.
Version‑control all CPU‑related configuration files and keep a backup before any change.
6. Monitoring & Alerting
Key metrics to watch: node_cpu_seconds_total split by user, system, iowait, steal, softirq.
Load average (1‑ and 5‑minute).
Top‑N process CPU usage.
Network retransmits, drops, SYN backlog overflow.
Container throttling counters ( container_cpu_cfs_throttled_periods_total).
Example Prometheus rules (simplified):
groups:
- name: cpu-hotspot
rules:
- alert: HostCpuHigh
expr: 100 - (avg by(instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 85
for: 5m
labels:
severity: critical
annotations:
summary: "Host CPU sustained high"
- alert: HostSoftIrqHigh
expr: avg by(instance) (rate(node_cpu_seconds_total{mode="softirq"}[5m])) * 100 > 25
for: 3m
labels:
severity: warning
annotations:
summary: "Host soft‑interrupt CPU high"
- alert: ContainerCpuThrottlingHigh
expr: sum by(pod,namespace) (rate(container_cpu_cfs_throttled_periods_total[5m])) > 20
for: 5m
labels:
severity: warning
annotations:
summary: "Container CPU throttling increasing"7. Post‑mortem & Continuous Improvement
After a fix, verify three closure criteria:
Hot thread/function disappears or its contribution drops significantly (check perf report).
Key business endpoint returns to normal latency (e.g.,
curl -s -o /dev/null -w "%{http_code} %{time_total}
" http://127.0.0.1:8080/health).
Monitoring metrics return to the 7‑day P95 baseline.
Document the root cause, the evidence collected, and the remediation steps. Incorporate the lessons into SOPs and automate snapshot collection via alert‑triggered scripts.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Raymond Ops
Linux ops automation, cloud-native, Kubernetes, SRE, DevOps, Python, Golang and related tech discussions.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
