Rapid CPU Spike Diagnosis: Resolve High CPU Usage in Under 5 Minutes
This guide presents a step‑by‑step, standardized process for detecting, analyzing, and fixing sudden CPU usage spikes on Linux servers, covering preparation, quick identification, deep thread‑level investigation, stack and system‑call analysis, flame‑graph generation, emergency mitigation, and best‑practice recommendations.
Overview
CPU usage spikes are common in production, often leaving operators scrambling to identify the root cause. This document defines a standardized SOP that enables a complete evidence chain within five minutes.
Technical Characteristics
Fast response : initial assessment completed within 5 minutes.
Evidence preservation : each step records data for later review.
Root‑cause orientation : goes beyond merely killing processes.
Standardized workflow : SOP can be executed by any team member.
Applicable Scenarios
Production CPU alerts.
Performance regressions after a new release.
Periodic CPU spikes requiring post‑mortem analysis.
Post‑incident reporting.
Environment Requirements
OS: CentOS 7+/Ubuntu 18.04+/Debian 9+.
Kernel ≥ 3.10 (required for perf).
Tools: sysstat, perf, strace, htop, iotop, lsof.
Root or sudo privileges.
Step‑by‑Step Procedure
2.1 Preparation
Ensure all required tools are installed.
# CentOS/RHEL
sudo yum install -y sysstat perf strace htop iotop lsof
# Ubuntu/Debian
sudo apt install -y sysstat linux-tools-common linux-tools-$(uname -r) strace htop iotop lsof
# Verify installation
which mpstat pidstat perf straceCreate a directory to store the incident data and record the start time.
INCIDENT_DIR="/tmp/incident_$(date +%Y%m%d_%H%M%S)"
mkdir -p "$INCIDENT_DIR"
cd "$INCIDENT_DIR"
echo "=== CPU incident collection start ===" > timeline.log
echo "Start time: $(date '+%Y-%m-%d %H:%M:%S')" >> timeline.log2.2 Fast Identification (≤ 2 min)
Collect overall load and per‑CPU usage.
# System load
uptime
# Per‑CPU usage
mpstat -P ALL 1 3Identify the top CPU‑consuming process.
TOP_PID=$(ps aux --sort=-%cpu | awk 'NR==2{print $2}')
echo "Top PID: $TOP_PID" >> timeline.log2.3 Thread‑Level Investigation
List threads of the suspect process and find the hottest thread.
top -Hp $TOP_PID -bn1 | head -20 > threads_snapshot.txt
pidstat -t -p $TOP_PID 1 5 > pidstat_threads.txtConvert the thread ID to the hexadecimal format required by Java stack traces.
printf "%x
" $THREAD_ID # e.g., 0x303a2.4 Stack Analysis
For Java processes, use jstack to dump the stack and search for the thread ID.
jstack $TOP_PID > jstack.txt
grep -A30 "nid=0x303a" jstack.txtFor native binaries, use pstack or gdb to obtain backtraces.
pstack $TOP_PID > pstack.txt
# or
gdb -p $TOP_PID -batch -ex "thread apply all bt" > gdb_bt.txt2.5 System‑Call Profiling
Briefly trace system calls with strace (limit to 10 s).
timeout 10 strace -c -p $TOP_PID > strace_summary.txt
# Detailed trace if needed
timeout 5 strace -tt -T -p $TOP_PID > strace_detail.txt2.6 Perf Flame‑Graph (Optional)
Collect perf data for 30 s and generate a flame‑graph if the FlameGraph scripts are available.
perf record -F 99 -p $TOP_PID -g -- sleep 30
perf script | stackcollapse-perf.pl | flamegraph.pl > flamegraph.svg2.7 Emergency Mitigation
Lower process priority: renice 19 -p $TOP_PID Limit CPU with cgroups (example for 50 %):
sudo cgcreate -g cpu:/limit_group
echo 50000 | sudo tee /sys/fs/cgroup/cpu/limit_group/cpu.cfs_quota_us
echo 100000 | sudo tee /sys/fs/cgroup/cpu/limit_group/cpu.cfs_period_us
sudo cgclassify -g cpu:/limit_group $TOP_PIDIf the issue persists, restart the service after data collection.
2.8 Verification
Continuously monitor the process CPU after mitigation.
watch -n 1 "ps -p $TOP_PID -o %cpu,cmd"Best Practices & Cautions
Always collect data before making changes.
Strace and perf can impact performance; limit their duration.
Avoid pausing production processes with gdb unless absolutely necessary.
Secure collected data – it may contain passwords or business‑critical information.
Automate the script via alertmanager webhooks for zero‑delay execution.
Monitoring Recommendations
Define Prometheus alerts for total CPU usage, system‑mode CPU, and iowait. Example thresholds: warning > 85 %, critical > 95 % for total CPU; system‑mode > 30 % as a warning.
Summary
The presented SOP enables operators to capture a complete evidence chain within minutes, isolate the offending thread or system call, and apply safe mitigations while preserving data for post‑mortem analysis. Regular baseline collection and automated alert‑driven execution further reduce MTTR for CPU‑related incidents.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Raymond Ops
Linux ops automation, cloud-native, Kubernetes, SRE, DevOps, Python, Golang and related tech discussions.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
