How I Pinpointed the Real Culprit of a 100% CPU Spike in Production in Just 3 Minutes
When a production server hit 100% CPU at 3 AM, the author walks through a three‑minute, step‑by‑step method—quickly identifying the offending process, drilling into threads, and pinpointing problematic code—while sharing useful shell commands, common pitfalls, advanced safeguards like cgroup limits and eBPF tracing.
Production CPU Spike: How I Found the Real Culprit in 3 Minutes
Midnight Alarm
"System is dead! The website is unreachable!" At 3:17 AM a colleague’s call woke me up. Logging into the jump host, SSH to the production server was painfully slow, and top showed CPU usage stuck at 100%.
Thousands of users were waiting; every second of delay meant business loss.
Why 100% CPU Is Terrifying
In ten years of operations, a CPU at full capacity is one of the most common performance problems. It causes:
Response time to skyrocket, degrading user experience.
Queue buildup and memory pressure, potentially triggering OOM.
Scheduler difficulty, even SSH may fail to log in.
Alert storms and cascading time‑out errors.
My 3‑Minute Diagnosis Rule
After many battles I distilled a fast‑track method. The key is a disciplined, step‑by‑step approach.
Minute 1: Quick Stop‑Bleeding
# Identify the misbehaving process
top -c -b -n 1 | head -20
# If top is too slow, use a lightweight ps
ps aux | head -1; ps aux | sort -rn -k3 | head -10Tip: top -b -n 1 runs non‑interactive and exits after one sample, saving resources.
Example: a Java process appeared to consume 99.8% CPU, but the real culprit lay deeper.
Minute 2: Deep Thread Digging
# Find which thread inside the Java process is hot
top -H -p <PID>
# Or use a more precise method
ps -mp <PID> -o THREAD,tid,time | sort -rn -k2 | head -20Convert the thread ID to hexadecimal (many overlook this step):
printf "%x
" <THREAD_ID>Minute 3: Locate the Faulty Code
# Export the thread stack
jstack <PID> | grep -A 30 <HEX_THREAD_ID>
# For non‑Java processes, trace system calls
strace -p <PID> -c -fUsing this method I once identified a dead‑loop scheduled task that got stuck parsing a malformed JSON.
Common Pitfalls
Pitfall 1: Ignoring Load Average
CPU at 100% isn’t always the bottleneck. A case with load average 80 while CPU usage was only 30% turned out to be disk I/O blocking.
# Monitor three key metrics
uptime # check load average
iostat -x 1 # check I/O wait
vmstat 1 # overall system viewPitfall 2: Brutally Killing Processes
Newcomers often kill the high‑CPU process outright, which can corrupt databases. Correct approach:
# Preserve the state
jstack <PID> > /tmp/jstack_$(date +%F-%T).txt
# Attempt graceful stop
kill -15 <PID>
# Wait 30 s, then force kill if needed
kill -9 <PID>Pitfall 3: Ignoring Zombie Processes
# Check for zombie processes
ps aux | grep defunct
# Find the parent that created the zombie
ps -ef | grep <ZOMBIE_PID> | grep -v grepZombie processes don’t use CPU, but a large number signals parent‑process issues that often precede CPU spikes.
Advanced Tips: Prevention Beats Cure
1. Set CPU Limits
# Use cgroup to limit CPU
echo 50000 > /sys/fs/cgroup/cpu/myapp/cpu.cfs_quota_us
# Or adjust niceness
nice -n 10 ./my_process.sh2. Smart Alerting
Don’t wait for 100% CPU to alert. My thresholds:
70% sustained 5 min → warning
85% sustained 3 min → critical
95% sustained 1 min → emergency
3. Automated Diagnosis Script
#!/bin/bash
# cpu_diagnose.sh
CPU_THRESHOLD=80
CPU_USAGE=$(top -b -n1 | grep "Cpu(s)" | awk '{print $2}' | cut -d'%' -f1)
if (( $(echo "$CPU_USAGE > $CPU_THRESHOLD" | bc -l) )); then
DATE=$(date +%F-%T)
DIAG_DIR="/var/log/cpu_diagnose/$DATE"
mkdir -p "$DIAG_DIR"
top -b -n 3 > "$DIAG_DIR/top.txt"
ps aux > "$DIAG_DIR/ps.txt"
iostat -x 3 3 > "$DIAG_DIR/iostat.txt"
for pid in $(ps aux | grep java | grep -v grep | awk '{print $2}'); do
jstack $pid > "$DIAG_DIR/jstack_$pid.txt" 2>&1
done
echo "Diagnostic information saved to $DIAG_DIR"
fiNew Trend: eBPF Is the Future
Traditional tools may fail under heavy load. eBPF (Extended Berkeley Packet Filter) offers low‑overhead, kernel‑level observability.
# Use bcc tools to trace CPU usage
/usr/share/bcc/tools/cpudist 10 1
# Flame‑graph analysis
/usr/share/bcc/tools/profile -F 99 -p <PID> 30 > out.stacks
flamegraph.pl < out.stacks > flamegraph.svgAlmost zero overhead
No need to restart processes
Can trace kernel space
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
MaGe Linux Operations
Founded in 2009, MaGe Education is a top Chinese high‑end IT training brand. Its graduates earn 12K+ RMB salaries, and the school has trained tens of thousands of students. It offers high‑pay courses in Linux cloud operations, Python full‑stack, automation, data analysis, AI, and Go high‑concurrency architecture. Thanks to quality courses and a solid reputation, it has talent partnerships with numerous internet firms.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
