Operations 21 min read

Rapid CPU Spike Diagnosis: Resolve High CPU Usage in Under 5 Minutes

This guide presents a step‑by‑step, standardized process for detecting, analyzing, and fixing sudden CPU usage spikes on Linux servers, covering preparation, quick identification, deep thread‑level investigation, stack and system‑call analysis, flame‑graph generation, emergency mitigation, and best‑practice recommendations.

Raymond Ops
Raymond Ops
Raymond Ops
Rapid CPU Spike Diagnosis: Resolve High CPU Usage in Under 5 Minutes

Overview

CPU usage spikes are common in production, often leaving operators scrambling to identify the root cause. This document defines a standardized SOP that enables a complete evidence chain within five minutes.

Technical Characteristics

Fast response : initial assessment completed within 5 minutes.

Evidence preservation : each step records data for later review.

Root‑cause orientation : goes beyond merely killing processes.

Standardized workflow : SOP can be executed by any team member.

Applicable Scenarios

Production CPU alerts.

Performance regressions after a new release.

Periodic CPU spikes requiring post‑mortem analysis.

Post‑incident reporting.

Environment Requirements

OS: CentOS 7+/Ubuntu 18.04+/Debian 9+.

Kernel ≥ 3.10 (required for perf).

Tools: sysstat, perf, strace, htop, iotop, lsof.

Root or sudo privileges.

Step‑by‑Step Procedure

2.1 Preparation

Ensure all required tools are installed.

# CentOS/RHEL
sudo yum install -y sysstat perf strace htop iotop lsof

# Ubuntu/Debian
sudo apt install -y sysstat linux-tools-common linux-tools-$(uname -r) strace htop iotop lsof

# Verify installation
which mpstat pidstat perf strace

Create a directory to store the incident data and record the start time.

INCIDENT_DIR="/tmp/incident_$(date +%Y%m%d_%H%M%S)"
mkdir -p "$INCIDENT_DIR"
cd "$INCIDENT_DIR"
echo "=== CPU incident collection start ===" > timeline.log
echo "Start time: $(date '+%Y-%m-%d %H:%M:%S')" >> timeline.log

2.2 Fast Identification (≤ 2 min)

Collect overall load and per‑CPU usage.

# System load
uptime

# Per‑CPU usage
mpstat -P ALL 1 3

Identify the top CPU‑consuming process.

TOP_PID=$(ps aux --sort=-%cpu | awk 'NR==2{print $2}')
echo "Top PID: $TOP_PID" >> timeline.log

2.3 Thread‑Level Investigation

List threads of the suspect process and find the hottest thread.

top -Hp $TOP_PID -bn1 | head -20 > threads_snapshot.txt
pidstat -t -p $TOP_PID 1 5 > pidstat_threads.txt

Convert the thread ID to the hexadecimal format required by Java stack traces.

printf "%x
" $THREAD_ID   # e.g., 0x303a

2.4 Stack Analysis

For Java processes, use jstack to dump the stack and search for the thread ID.

jstack $TOP_PID > jstack.txt
grep -A30 "nid=0x303a" jstack.txt

For native binaries, use pstack or gdb to obtain backtraces.

pstack $TOP_PID > pstack.txt
# or
gdb -p $TOP_PID -batch -ex "thread apply all bt" > gdb_bt.txt

2.5 System‑Call Profiling

Briefly trace system calls with strace (limit to 10 s).

timeout 10 strace -c -p $TOP_PID > strace_summary.txt
# Detailed trace if needed
timeout 5 strace -tt -T -p $TOP_PID > strace_detail.txt

2.6 Perf Flame‑Graph (Optional)

Collect perf data for 30 s and generate a flame‑graph if the FlameGraph scripts are available.

perf record -F 99 -p $TOP_PID -g -- sleep 30
perf script | stackcollapse-perf.pl | flamegraph.pl > flamegraph.svg

2.7 Emergency Mitigation

Lower process priority: renice 19 -p $TOP_PID Limit CPU with cgroups (example for 50 %):

sudo cgcreate -g cpu:/limit_group
echo 50000 | sudo tee /sys/fs/cgroup/cpu/limit_group/cpu.cfs_quota_us
echo 100000 | sudo tee /sys/fs/cgroup/cpu/limit_group/cpu.cfs_period_us
sudo cgclassify -g cpu:/limit_group $TOP_PID

If the issue persists, restart the service after data collection.

2.8 Verification

Continuously monitor the process CPU after mitigation.

watch -n 1 "ps -p $TOP_PID -o %cpu,cmd"

Best Practices & Cautions

Always collect data before making changes.

Strace and perf can impact performance; limit their duration.

Avoid pausing production processes with gdb unless absolutely necessary.

Secure collected data – it may contain passwords or business‑critical information.

Automate the script via alertmanager webhooks for zero‑delay execution.

Monitoring Recommendations

Define Prometheus alerts for total CPU usage, system‑mode CPU, and iowait. Example thresholds: warning > 85 %, critical > 95 % for total CPU; system‑mode > 30 % as a warning.

Summary

The presented SOP enables operators to capture a complete evidence chain within minutes, isolate the offending thread or system call, and apply safe mitigations while preserving data for post‑mortem analysis. Regular baseline collection and automated alert‑driven execution further reduce MTTR for CPU‑related incidents.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

MonitoringPerformanceLinuxTroubleshootingCPUShell
Raymond Ops
Written by

Raymond Ops

Linux ops automation, cloud-native, Kubernetes, SRE, DevOps, Python, Golang and related tech discussions.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.