How to Diagnose 100% CPU Spikes in 3 Minutes with 3 Simple Steps
This article walks you through a practical three‑step, three‑minute method for quickly identifying the root cause of a server CPU hitting 100%, covering process identification, thread and code pinpointing, and system‑level resource analysis across Linux environments.
Introduction
On a Friday afternoon a production server alarmed that CPU usage had surged to 98% for three minutes, threatening service slowdown or outage. The author shares a "3‑step, 3‑minute" method to locate the root cause of high CPU quickly.
Understanding CPU Usage
CPU usage composition
Many think CPU usage is a simple percentage, but Linux reports multiple fields.
# top command example
%Cpu(s): 25.3 us, 5.2 sy, 0.0 ni, 68.1 id, 1.2 wa, 0.0 hi, 0.2 si, 0.0 stField meanings:
us : user‑space CPU time
sy : kernel‑space CPU time
ni : time spent on niced processes
id : idle CPU time
wa : I/O wait time
hi : hardware interrupt time
si : software interrupt time
st : stolen time (virtualized environments)
Common high‑CPU types
Based on metrics, high CPU can be classified as:
us high : application code (loops, regex, etc.)
sy high : excessive system calls, network packet handling
wa high : I/O wait, CPU blocked on disk
si high : soft interrupts, often heavy network traffic
Key insight : 100% CPU does not always mean the CPU is busy; high iowait means it is waiting.
Why the first 3 minutes matter
CPU spikes cause slower responses, queue buildup, monitoring gaps, and a narrow window for evidence collection. Quick containment beats perfect analysis.
Core Content: 3‑Step Diagnosis
Step 1 – Pinpoint the offending process (≈30 s)
Using top
# Launch top sorted by CPU
top -c
# Key shortcuts:
# P: sort by CPU
# M: sort by memory
# c: show full command line
# 1: per‑CPU viewImportant fields to watch:
PID and command
CPU% (note multi‑core can exceed 100%)
TIME+ cumulative CPU time
S column (R or D state)
Using htop (if installed)
htop
# F5: tree view
# F6: sort field
# F9: kill processUsing ps
# Top 10 CPU processes
ps aux | head -1; ps aux | sort -rn -k3 | head -10
# Show processes of a user
ps -u www-data -o pid,ppid,%cpu,%mem,cmd --sort=-%cpu | head -20One‑click diagnostic script
#!/bin/bash
echo "========== CPU overall =========="
uptime
echo ""
echo "========== CPU detail =========="
top -bn1 | head -20
echo ""
echo "========== TOP 10 processes =========="
ps aux --sort=-%cpu | head -11
# ... (rest omitted for brevity)Case: a Java process showed 350% CPU on a 4‑core box because four threads each ran at 100%.
Step 2 – Drill to thread and code (≈90 s)
Identify the thread and source line causing the load.
Show threads of a process
# top in thread mode
top -H -p <PID>
# ps thread view
ps -Lp <PID> -o pid,lwp,ppid,pcpu,comm --sort=-pcpu | head -20Record the highest‑CPU thread ID (LWP) and convert to hex for Java stack lookup.
# Convert thread ID to hex
printf "0x%x
" 12345 # => 0x3039Java thread to code
# Export Java thread stack
jstack <PID> > /tmp/jstack.log
# Search for hex thread ID
grep -A 20 "0x3039" /tmp/jstack.logTypical high‑CPU patterns: infinite loops, costly regex, serialization, heavy SQL/ORM.
Python/Node.js
# py‑spy
py-spy top --pid <PID>
py-spy record -o profile.svg --pid <PID> --duration 30
# Node.js
kill -USR1 <PID>
node --prof app.js
clinic doctor -- node app.jsGo
# pprof
go tool pprof http://localhost:6060/debug/pprof/profile?seconds=30C/C++
# perf
perf record -p <PID> -g -- sleep 30
perf reportReal‑world example: a Python data‑processing service spiked to 400% CPU because a pandas merge on 3 M rows caused exponential work; chunking reduced usage to ~30%.
Step 3 – System‑level bottleneck analysis (≈60 s)
Sometimes the issue is not application code but resource limits.
I/O wait
# iostat
iostat -x 1 5
# Processes waiting on I/O
pidstat -d 1 5
iotop -oHigh %wa >20% and %util ≈100% indicate a disk bottleneck.
Network soft‑interrupts
# softirqs
cat /proc/softirqs
watch -d "cat /proc/softirqs | head -20"
# Network traffic
iftop -i eth0
nload eth0High %si >10% often means heavy network traffic or DDoS.
Process creation rate
# pidstat
pidstat 1 5
# syscalls
strace -c -p <PID>
perf stat -e 'syscalls:sys_enter_*' -a sleep 10High %sy >30% may indicate excessive system calls.
Context switches
# vmstat
vmstat 1 5
# per‑process switches
pidstat -w 1 5More than 1 M switches/sec can be problematic.
Memory pressure
# free
free -h
# meminfo
cat /proc/meminfo | grep -E "(MemTotal|MemFree|MemAvailable|SwapTotal|SwapFree)"
dmesg | grep -i "out of memory"
slabtopInsufficient memory causing swap can inflate CPU usage.
Comprehensive diagnostic commands
# dstat
dstat -tcmndylsp --top-cpu --top-mem 5
# sar
sar -u 1 10
sar -q 1 10
sar -w 1 10
# One‑click report
(
echo "=== System Load ==="
uptime
echo ""
echo "=== CPU Stats ==="
mpstat -P ALL 1 1
echo ""
echo "=== Memory Stats ==="
free -h
echo ""
echo "=== I/O Stats ==="
iostat -x 1 1
echo ""
echo "=== Network Stats ==="
ss -s
echo ""
echo "=== Top Processes ==="
ps aux --sort=-%cpu | head -20
) | tee cpu_diagnosis_$(date +%Y%m%d_%H%M%S).txtPractical case study
Background
An e‑commerce API gateway spiked to 98% CPU, response time rose from 50 ms to 5 s.
Diagnosis timeline
14:32:15 – alert, SSH in
Step 1 (25 s) – top -c shows nginx workers at 780% on an 8‑core box.
Step 2 (90 s) – top -H -p 18234 shows all workers at ~100%; strace -c -p 18234 shows many epoll_wait calls; netstat and iftop show normal traffic.
Step 3 (60 s) – iostat shows %wa 2%; softirqs modest; access log reveals heavy calls to /api/product/detail.
Further investigation finds PHP‑FPM processes all in “R” state, MySQL connection timeouts, and a primary DB failure.
Root cause
Database connection timeout causing each request to wait 30 s.
Hundreds of concurrent waiting requests created a loop.
MySQL primary was down; VIP not switched.
Remediation
Manual primary‑replica switch (3 min).
Restart PHP‑FPM (1 min).
Gradually restore traffic (5 min).
Total recovery time: 12 minutes.
Takeaways
3‑step method quickly surfaces the symptom.
Don’t stop at nginx; trace to backend and DB.
Preserve logs and strace output for post‑mortem.
Best practices & prevention
Daily monitoring
Essential CPU metrics (Prometheus example):
groups:
- name: cpu_alerts
rules:
- alert: HighCPUUsage
expr: 100 - (avg by(instance) (irate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 80
for: 3m
labels:
severity: warning
annotations:
summary: "CPU usage >80% for 3 minutes"
- alert: HighIOWait
expr: avg by(instance) (irate(node_cpu_seconds_total{mode="iowait"}[5m])) * 100 > 20
for: 2m
labels:
severity: warning
annotations:
summary: "I/O wait >20%"Application‑level optimizations
Set proper timeouts (Python example):
import requests
response = requests.get('http://api.example.com/data', timeout=(3, 10))Offload CPU‑heavy work to worker threads (Node.js):
const { Worker } = require('worker_threads');
function heavyComputation(data) {
return new Promise((resolve, reject) => {
const worker = new Worker('./heavy_task.js', { workerData: data });
worker.on('message', resolve);
worker.on('error', reject);
});
}Use token‑bucket rate limiting (Go):
import "golang.org/x/time/rate"
limiter := rate.NewLimiter(rate.Limit(100), 200)
if !limiter.Allow() {
http.Error(w, "Too Many Requests", http.StatusTooManyRequests)
return
}System‑level tweaks
Raise process priority and bind to CPUs:
# Increase priority
renice -n -10 -p <PID>
# Bind to cores 0‑3
taskset -c 0-3 <PID>Adjust network parameters (sysctl):
net.core.netdev_max_backlog = 5000
net.core.somaxconn = 1024Select appropriate I/O scheduler (SSD → noop, HDD → cfq):
# Show current scheduler
cat /sys/block/sda/queue/scheduler
# Set scheduler
echo noop > /sys/block/sda/queue/scheduler # SSD
echo cfq > /sys/block/sda/queue/scheduler # HDDSummary
CPU spikes are common in operations. The “3‑step, 3‑minute” method—quickly lock the process, pinpoint the thread/code, and examine system‑level resources—lets you contain and resolve most incidents efficiently.
Key points
Layered diagnosis from process → thread → code.
Comprehensive view: CPU, I/O, network, memory, context switches.
Preserve evidence for post‑mortem.
Tool‑first approach: keep one‑click scripts ready.
Further learning
Book: “Systems Performance” by Brendan Gregg.
Tools: perf, eBPF, flamegraph.
Practice: regular load testing and chaos drills.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
MaGe Linux Operations
Founded in 2009, MaGe Education is a top Chinese high‑end IT training brand. Its graduates earn 12K+ RMB salaries, and the school has trained tens of thousands of students. It offers high‑pay courses in Linux cloud operations, Python full‑stack, automation, data analysis, AI, and Go high‑concurrency architecture. Thanks to quality courses and a solid reputation, it has talent partnerships with numerous internet firms.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
