Operations 24 min read

How to Diagnose 100% CPU Spikes in 3 Minutes with 3 Simple Steps

This article walks you through a practical three‑step, three‑minute method for quickly identifying the root cause of a server CPU hitting 100%, covering process identification, thread and code pinpointing, and system‑level resource analysis across Linux environments.

MaGe Linux Operations
MaGe Linux Operations
MaGe Linux Operations
How to Diagnose 100% CPU Spikes in 3 Minutes with 3 Simple Steps

Introduction

On a Friday afternoon a production server alarmed that CPU usage had surged to 98% for three minutes, threatening service slowdown or outage. The author shares a "3‑step, 3‑minute" method to locate the root cause of high CPU quickly.

Understanding CPU Usage

CPU usage composition

Many think CPU usage is a simple percentage, but Linux reports multiple fields.

# top command example
%Cpu(s):  25.3 us,  5.2 sy,  0.0 ni, 68.1 id,  1.2 wa,  0.0 hi,  0.2 si,  0.0 st

Field meanings:

us : user‑space CPU time

sy : kernel‑space CPU time

ni : time spent on niced processes

id : idle CPU time

wa : I/O wait time

hi : hardware interrupt time

si : software interrupt time

st : stolen time (virtualized environments)

Common high‑CPU types

Based on metrics, high CPU can be classified as:

us high : application code (loops, regex, etc.)

sy high : excessive system calls, network packet handling

wa high : I/O wait, CPU blocked on disk

si high : soft interrupts, often heavy network traffic

Key insight : 100% CPU does not always mean the CPU is busy; high iowait means it is waiting.

Why the first 3 minutes matter

CPU spikes cause slower responses, queue buildup, monitoring gaps, and a narrow window for evidence collection. Quick containment beats perfect analysis.

Core Content: 3‑Step Diagnosis

Step 1 – Pinpoint the offending process (≈30 s)

Using top

# Launch top sorted by CPU
top -c

# Key shortcuts:
# P: sort by CPU
# M: sort by memory
# c: show full command line
# 1: per‑CPU view

Important fields to watch:

PID and command

CPU% (note multi‑core can exceed 100%)

TIME+ cumulative CPU time

S column (R or D state)

Using htop (if installed)

htop
# F5: tree view
# F6: sort field
# F9: kill process

Using ps

# Top 10 CPU processes
ps aux | head -1; ps aux | sort -rn -k3 | head -10

# Show processes of a user
ps -u www-data -o pid,ppid,%cpu,%mem,cmd --sort=-%cpu | head -20

One‑click diagnostic script

#!/bin/bash
echo "========== CPU overall =========="
uptime
echo ""
echo "========== CPU detail =========="
top -bn1 | head -20
echo ""
echo "========== TOP 10 processes =========="
ps aux --sort=-%cpu | head -11
# ... (rest omitted for brevity)

Case: a Java process showed 350% CPU on a 4‑core box because four threads each ran at 100%.

Step 2 – Drill to thread and code (≈90 s)

Identify the thread and source line causing the load.

Show threads of a process

# top in thread mode
top -H -p <PID>

# ps thread view
ps -Lp <PID> -o pid,lwp,ppid,pcpu,comm --sort=-pcpu | head -20

Record the highest‑CPU thread ID (LWP) and convert to hex for Java stack lookup.

# Convert thread ID to hex
printf "0x%x
" 12345   # => 0x3039

Java thread to code

# Export Java thread stack
jstack <PID> > /tmp/jstack.log
# Search for hex thread ID
grep -A 20 "0x3039" /tmp/jstack.log

Typical high‑CPU patterns: infinite loops, costly regex, serialization, heavy SQL/ORM.

Python/Node.js

# py‑spy
py-spy top --pid <PID>
py-spy record -o profile.svg --pid <PID> --duration 30

# Node.js
kill -USR1 <PID>
node --prof app.js
clinic doctor -- node app.js

Go

# pprof
go tool pprof http://localhost:6060/debug/pprof/profile?seconds=30

C/C++

# perf
perf record -p <PID> -g -- sleep 30
perf report

Real‑world example: a Python data‑processing service spiked to 400% CPU because a pandas merge on 3 M rows caused exponential work; chunking reduced usage to ~30%.

Step 3 – System‑level bottleneck analysis (≈60 s)

Sometimes the issue is not application code but resource limits.

I/O wait

# iostat
iostat -x 1 5

# Processes waiting on I/O
pidstat -d 1 5
iotop -o

High %wa >20% and %util ≈100% indicate a disk bottleneck.

Network soft‑interrupts

# softirqs
cat /proc/softirqs
watch -d "cat /proc/softirqs | head -20"
# Network traffic
iftop -i eth0
nload eth0

High %si >10% often means heavy network traffic or DDoS.

Process creation rate

# pidstat
pidstat 1 5
# syscalls
strace -c -p <PID>
perf stat -e 'syscalls:sys_enter_*' -a sleep 10

High %sy >30% may indicate excessive system calls.

Context switches

# vmstat
vmstat 1 5
# per‑process switches
pidstat -w 1 5

More than 1 M switches/sec can be problematic.

Memory pressure

# free
free -h
# meminfo
cat /proc/meminfo | grep -E "(MemTotal|MemFree|MemAvailable|SwapTotal|SwapFree)"
dmesg | grep -i "out of memory"
slabtop

Insufficient memory causing swap can inflate CPU usage.

Comprehensive diagnostic commands

# dstat
dstat -tcmndylsp --top-cpu --top-mem 5

# sar
sar -u 1 10
sar -q 1 10
sar -w 1 10

# One‑click report
(
  echo "=== System Load ==="
  uptime
  echo ""
  echo "=== CPU Stats ==="
  mpstat -P ALL 1 1
  echo ""
  echo "=== Memory Stats ==="
  free -h
  echo ""
  echo "=== I/O Stats ==="
  iostat -x 1 1
  echo ""
  echo "=== Network Stats ==="
  ss -s
  echo ""
  echo "=== Top Processes ==="
  ps aux --sort=-%cpu | head -20
) | tee cpu_diagnosis_$(date +%Y%m%d_%H%M%S).txt

Practical case study

Background

An e‑commerce API gateway spiked to 98% CPU, response time rose from 50 ms to 5 s.

Diagnosis timeline

14:32:15 – alert, SSH in

Step 1 (25 s) – top -c shows nginx workers at 780% on an 8‑core box.

Step 2 (90 s) – top -H -p 18234 shows all workers at ~100%; strace -c -p 18234 shows many epoll_wait calls; netstat and iftop show normal traffic.

Step 3 (60 s) – iostat shows %wa 2%; softirqs modest; access log reveals heavy calls to /api/product/detail.

Further investigation finds PHP‑FPM processes all in “R” state, MySQL connection timeouts, and a primary DB failure.

Root cause

Database connection timeout causing each request to wait 30 s.

Hundreds of concurrent waiting requests created a loop.

MySQL primary was down; VIP not switched.

Remediation

Manual primary‑replica switch (3 min).

Restart PHP‑FPM (1 min).

Gradually restore traffic (5 min).

Total recovery time: 12 minutes.

Takeaways

3‑step method quickly surfaces the symptom.

Don’t stop at nginx; trace to backend and DB.

Preserve logs and strace output for post‑mortem.

Best practices & prevention

Daily monitoring

Essential CPU metrics (Prometheus example):

groups:
- name: cpu_alerts
  rules:
  - alert: HighCPUUsage
    expr: 100 - (avg by(instance) (irate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 80
    for: 3m
    labels:
      severity: warning
    annotations:
      summary: "CPU usage >80% for 3 minutes"
  - alert: HighIOWait
    expr: avg by(instance) (irate(node_cpu_seconds_total{mode="iowait"}[5m])) * 100 > 20
    for: 2m
    labels:
      severity: warning
    annotations:
      summary: "I/O wait >20%"

Application‑level optimizations

Set proper timeouts (Python example):

import requests
response = requests.get('http://api.example.com/data', timeout=(3, 10))

Offload CPU‑heavy work to worker threads (Node.js):

const { Worker } = require('worker_threads');
function heavyComputation(data) {
  return new Promise((resolve, reject) => {
    const worker = new Worker('./heavy_task.js', { workerData: data });
    worker.on('message', resolve);
    worker.on('error', reject);
  });
}

Use token‑bucket rate limiting (Go):

import "golang.org/x/time/rate"
limiter := rate.NewLimiter(rate.Limit(100), 200)
if !limiter.Allow() {
    http.Error(w, "Too Many Requests", http.StatusTooManyRequests)
    return
}

System‑level tweaks

Raise process priority and bind to CPUs:

# Increase priority
renice -n -10 -p <PID>
# Bind to cores 0‑3
taskset -c 0-3 <PID>

Adjust network parameters (sysctl):

net.core.netdev_max_backlog = 5000
net.core.somaxconn = 1024

Select appropriate I/O scheduler (SSD → noop, HDD → cfq):

# Show current scheduler
cat /sys/block/sda/queue/scheduler
# Set scheduler
echo noop > /sys/block/sda/queue/scheduler   # SSD
echo cfq  > /sys/block/sda/queue/scheduler   # HDD

Summary

CPU spikes are common in operations. The “3‑step, 3‑minute” method—quickly lock the process, pinpoint the thread/code, and examine system‑level resources—lets you contain and resolve most incidents efficiently.

Key points

Layered diagnosis from process → thread → code.

Comprehensive view: CPU, I/O, network, memory, context switches.

Preserve evidence for post‑mortem.

Tool‑first approach: keep one‑click scripts ready.

Further learning

Book: “Systems Performance” by Brendan Gregg.

Tools: perf, eBPF, flamegraph.

Practice: regular load testing and chaos drills.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

LinuxCPU
MaGe Linux Operations
Written by

MaGe Linux Operations

Founded in 2009, MaGe Education is a top Chinese high‑end IT training brand. Its graduates earn 12K+ RMB salaries, and the school has trained tens of thousands of students. It offers high‑pay courses in Linux cloud operations, Python full‑stack, automation, data analysis, AI, and Go high‑concurrency architecture. Thanks to quality courses and a solid reputation, it has talent partnerships with numerous internet firms.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.