Operations 9 min read

How I Pinpointed the Real Culprit of a 100% CPU Spike in Production in Just 3 Minutes

When a production server hit 100% CPU at 3 AM, the author walks through a three‑minute, step‑by‑step method—quickly identifying the offending process, drilling into threads, and pinpointing problematic code—while sharing useful shell commands, common pitfalls, advanced safeguards like cgroup limits and eBPF tracing.

MaGe Linux Operations
MaGe Linux Operations
MaGe Linux Operations
How I Pinpointed the Real Culprit of a 100% CPU Spike in Production in Just 3 Minutes

Production CPU Spike: How I Found the Real Culprit in 3 Minutes

Midnight Alarm

"System is dead! The website is unreachable!" At 3:17 AM a colleague’s call woke me up. Logging into the jump host, SSH to the production server was painfully slow, and top showed CPU usage stuck at 100%.

Thousands of users were waiting; every second of delay meant business loss.

Why 100% CPU Is Terrifying

In ten years of operations, a CPU at full capacity is one of the most common performance problems. It causes:

Response time to skyrocket, degrading user experience.

Queue buildup and memory pressure, potentially triggering OOM.

Scheduler difficulty, even SSH may fail to log in.

Alert storms and cascading time‑out errors.

My 3‑Minute Diagnosis Rule

After many battles I distilled a fast‑track method. The key is a disciplined, step‑by‑step approach.

Minute 1: Quick Stop‑Bleeding

# Identify the misbehaving process
top -c -b -n 1 | head -20

# If top is too slow, use a lightweight ps
ps aux | head -1; ps aux | sort -rn -k3 | head -10

Tip: top -b -n 1 runs non‑interactive and exits after one sample, saving resources.

Example: a Java process appeared to consume 99.8% CPU, but the real culprit lay deeper.

Minute 2: Deep Thread Digging

# Find which thread inside the Java process is hot
top -H -p <PID>

# Or use a more precise method
ps -mp <PID> -o THREAD,tid,time | sort -rn -k2 | head -20

Convert the thread ID to hexadecimal (many overlook this step):

printf "%x
" <THREAD_ID>

Minute 3: Locate the Faulty Code

# Export the thread stack
jstack <PID> | grep -A 30 <HEX_THREAD_ID>

# For non‑Java processes, trace system calls
strace -p <PID> -c -f

Using this method I once identified a dead‑loop scheduled task that got stuck parsing a malformed JSON.

Common Pitfalls

Pitfall 1: Ignoring Load Average

CPU at 100% isn’t always the bottleneck. A case with load average 80 while CPU usage was only 30% turned out to be disk I/O blocking.

# Monitor three key metrics
uptime          # check load average
iostat -x 1    # check I/O wait
vmstat 1        # overall system view

Pitfall 2: Brutally Killing Processes

Newcomers often kill the high‑CPU process outright, which can corrupt databases. Correct approach:

# Preserve the state
jstack <PID> > /tmp/jstack_$(date +%F-%T).txt
# Attempt graceful stop
kill -15 <PID>
# Wait 30 s, then force kill if needed
kill -9 <PID>

Pitfall 3: Ignoring Zombie Processes

# Check for zombie processes
ps aux | grep defunct
# Find the parent that created the zombie
ps -ef | grep <ZOMBIE_PID> | grep -v grep

Zombie processes don’t use CPU, but a large number signals parent‑process issues that often precede CPU spikes.

Advanced Tips: Prevention Beats Cure

1. Set CPU Limits

# Use cgroup to limit CPU
echo 50000 > /sys/fs/cgroup/cpu/myapp/cpu.cfs_quota_us
# Or adjust niceness
nice -n 10 ./my_process.sh

2. Smart Alerting

Don’t wait for 100% CPU to alert. My thresholds:

70% sustained 5 min → warning

85% sustained 3 min → critical

95% sustained 1 min → emergency

3. Automated Diagnosis Script

#!/bin/bash
# cpu_diagnose.sh
CPU_THRESHOLD=80
CPU_USAGE=$(top -b -n1 | grep "Cpu(s)" | awk '{print $2}' | cut -d'%' -f1)
if (( $(echo "$CPU_USAGE > $CPU_THRESHOLD" | bc -l) )); then
    DATE=$(date +%F-%T)
    DIAG_DIR="/var/log/cpu_diagnose/$DATE"
    mkdir -p "$DIAG_DIR"
    top -b -n 3 > "$DIAG_DIR/top.txt"
    ps aux > "$DIAG_DIR/ps.txt"
    iostat -x 3 3 > "$DIAG_DIR/iostat.txt"
    for pid in $(ps aux | grep java | grep -v grep | awk '{print $2}'); do
        jstack $pid > "$DIAG_DIR/jstack_$pid.txt" 2>&1
    done
    echo "Diagnostic information saved to $DIAG_DIR"
fi

New Trend: eBPF Is the Future

Traditional tools may fail under heavy load. eBPF (Extended Berkeley Packet Filter) offers low‑overhead, kernel‑level observability.

# Use bcc tools to trace CPU usage
/usr/share/bcc/tools/cpudist 10 1
# Flame‑graph analysis
/usr/share/bcc/tools/profile -F 99 -p <PID> 30 > out.stacks
flamegraph.pl < out.stacks > flamegraph.svg

Almost zero overhead

No need to restart processes

Can trace kernel space

eBPF illustration
eBPF illustration
Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

OperationseBPFShell scriptingLinux performanceCPU troubleshooting
MaGe Linux Operations
Written by

MaGe Linux Operations

Founded in 2009, MaGe Education is a top Chinese high‑end IT training brand. Its graduates earn 12K+ RMB salaries, and the school has trained tens of thousands of students. It offers high‑pay courses in Linux cloud operations, Python full‑stack, automation, data analysis, AI, and Go high‑concurrency architecture. Thanks to quality courses and a solid reputation, it has talent partnerships with numerous internet firms.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.