How to Quickly Diagnose and Fix High CPU Usage on Linux: 10 Root Causes & Step‑by‑Step Guide
This guide walks you through detecting, analyzing, and resolving Linux CPU spikes by monitoring overall load, pinpointing the offending process, drilling down with tools like top, ps, strace, perf, and sar, and applying targeted fixes for the ten most common causes.
Background and Scope
When a Linux server shows CPU utilization above 80%, a single process may be hogging CPU, the system becomes sluggish, or services start timing out. The guide assumes RHEL 7+/Ubuntu 18.04+ with root or sudo privileges and common monitoring tools (top, perf, strace, sar, etc.). A single CPU core at 100% is the baseline; on multi‑core systems each core is evaluated independently.
Tool Matrix
top – native real‑time process monitor.
htop – enhanced top (yum/apt install).
perf – CPU flame‑graph generation (yum install perf, apt install linux‑tools).
sar – historical system statistics (sysstat package).
strace – system‑call tracing.
mpstat / iostat / vmstat – per‑core and I/O metrics.
Quick Checklist
Monitor overall CPU usage and load average.
Identify the process consuming the most CPU.
Analyze the process’s threads and system calls.
Collect historical performance data.
Generate and interpret flame graphs.
Determine the root cause (e.g., code inefficiency, lock contention, I/O wait).
Apply a fix and plan a rollback.
Set up long‑term monitoring and alerts.
Step‑by‑Step Implementation
Step 1 – Quick Diagnosis of Overall CPU State
Check load average:
# Method 1: uptime
uptime
# Method 2: view /proc/loadavg
cat /proc/loadavgTypical output:
10:45:32 up 10 days, 3:20, 2 users, load average: 2.45, 2.30, 2.15Interpretation: load average: 2.45, 2.30, 2.15 – 1‑, 5‑, 15‑minute averages.
Load > number of CPU cores indicates saturation.
Show CPU core count:
nproc
# or
grep -c '^processor' /proc/cpuinfoShow per‑core utilization (real‑time):
# Method 1: mpstat (requires sysstat)
sudo mpstat -P ALL 1 5
# Method 2: top
topSample mpstat output:
CPU %usr %nice %sys %iowait %irq %soft %guest %idle
0 45.2 0.0 8.5 2.1 0.0 0.1 0.0 44.1
1 78.9 0.0 15.3 2.1 0.0 0.1 0.0 3.6
2 12.3 0.0 5.6 1.2 0.0 0.0 0.0 80.9Key fields: %usr – user‑space CPU (application). %sys – kernel CPU. %iowait – CPU waiting for I/O. %idle – idle CPU.
Step 2 – Locate High‑CPU Process
Use top and sort by CPU: top Interactive shortcuts: Shift+P – sort by CPU (default). Shift+M – sort by memory. d 1 – refresh every second. q – quit.
Sample top output:
PID USER PR NI VIRT RES %CPU %MEM TIME+ COMMAND
5678 www 20 0 512m 256m 95.2 5.6 45:23 java -jar app.jar
1234 mysql 20 0 1.5g 800m 12.3 22.1 102:45 /usr/sbin/mysqldStatic snapshot with ps:
# Top 10 CPU consumers
ps aux --sort=-%cpu | head -10
# Inspect a specific PID
ps -p 5678 -o pid,cmd,%cpu,%mem
# Show all threads of a PID
ps -p 5678 -L -o pid,tid,cmd,%cpuStep 3 – Deep Dive with System‑Call Tracing
Trace all syscalls of the offending process:
# Full trace with statistics
sudo strace -p 5678 -e trace=all -c
# Focus on time‑consuming calls
sudo strace -p 5678 -c -S timeTypical strace -c output highlights high‑frequency calls such as futex, poll, read, and write. A high %time indicates a hotspot that needs optimization.
Step 4 – Generate CPU Flame Graph (Java Example)
Install the FlameGraph tools and collect perf data:
# Clone the FlameGraph repository
git clone https://github.com/brendangregg/FlameGraph.git
export PATH=$PATH:$(pwd)/FlameGraph
# Record 30 seconds of CPU samples for PID 5678
sudo perf record -F 99 -p 5678 -g -- sleep 30
# Convert to a flame‑graph
sudo perf script > out.perf
stackcollapse-perf.pl out.perf | flamegraph.pl > cpu_flame.svgInterpretation:
Width = total CPU time spent in a function (wider = hotter).
Height = call‑stack depth.
Clickable regions allow zooming for detailed analysis.
Step 5 – Ten Common Root Causes and Diagnosis Commands
1. CPU‑intensive business code – Symptom: high %usr. Command: perf report. Fix: optimise algorithm or add caching.
2. Infinite loop / recursion – Symptom: single process at 100% CPU, no I/O wait. Command: strace -p PID -c. Fix: review code, add logging.
3. Lock contention – Symptom: high %sys, many futex calls. Command: strace -p PID -e futex. Fix: reduce critical sections, use finer‑grained locks.
4. Frequent context switches – Symptom: high CPU but not fully loaded, many processes. Command: vmstat 1 | watch -n 1. Fix: limit thread count or set CPU affinity.
5. Excessive interrupts – Symptom: high %irq/%soft. Command: cat /proc/interrupts. Fix: tune NIC or storage driver parameters.
6. I/O‑bound workload masquerading as CPU load – Symptom: high %iowait. Command: iostat -x 1. Fix: optimise I/O path or upgrade to SSD.
7. Kernel memory reclamation (kswapd) – Symptom: high %sys, low free memory. Command: vmstat 1 | awk '{print $8}'. Fix: add RAM or improve application memory usage.
8. Network packet processing – Symptom: high %irq/%soft, large traffic. Command: sar -n DEV 1 10. Fix: adjust NIC interrupt coalescing, add queues.
9. Scheduler jitter – Symptom: erratic CPU usage, high cache misses. Command: perf stat -p PID -e cache-misses. Fix: bind process to CPUs, improve memory access patterns.
10. Kernel bug / driver issue – Symptom: unexplained CPU spikes. Command: sudo dmesg | tail. Fix: upgrade kernel or driver.
Step 6 – Apply Fixes
Option A – Process‑level optimisation (no restart)
# Limit CPU with cgroup
cgcreate -g cpu:/limited_app
cgset -r cpu.cfs_quota_us=80000 /limited_app # 80% of one CPU
cgexec -g cpu:/limited_app /opt/app/start.sh
# Set CPU affinity
taskset -pc 0,1,2 5678 # bind to CPUs 0‑2
# Adjust nice level
nice -n 10 /opt/app/start.sh # lower priority
renice -n 10 -p 5678Option B – System‑wide tuning
# Disable CPU frequency scaling (set performance mode)
sudo vi /etc/default/grub # add: intel_pstate=disable cpufreq=performance
sudo grub2-mkconfig -o /boot/grub2/grub.cfg
sudo reboot
# Change I/O scheduler
echo deadline > /sys/block/sda/queue/scheduler
# Reduce unnecessary interrupts (example for NIC)
sudo ethtool -C eth0 rx-usecs 500Option C – Application code optimisation (Python example)
# Replace threads with processes to avoid GIL contention
from multiprocessing import Process
def worker(data):
# CPU‑intensive work here
pass
if __name__ == '__main__':
processes = [Process(target=worker, args=(chunk,)) for chunk in data_chunks]
for p in processes:
p.start()
# Or use NumPy/C extensions to bypass GIL
import numpy as np
result = np.dot(matrix1, matrix2) # runs in native codeStep 7 – Set Up Continuous Monitoring & Alerts
Prometheus node exporter configuration (simplified):
# prometheus.yml
scrape_configs:
- job_name: 'node'
static_configs:
- targets: ['localhost:9100']
# alert_rules.yml
groups:
- name: cpu_alerts
rules:
- alert: CPUHigh
expr: (100 - avg by (instance) (irate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 80
for: 5m
annotations:
summary: "High CPU on {{ $labels.instance }}"
- alert: LoadAverage
expr: node_load1 > (count by (instance) (node_cpu_seconds_total{mode="idle"}) * 1.5)
for: 5m
annotations:
summary: "High load average on {{ $labels.instance }}"Grafana dashboards should display:
CPU usage trend (last 24 h).
Per‑core distribution.
Top 10 processes by CPU.
Load‑average and scheduler latency.
Performance Targets
Business‑level CPU: reduce 20‑50% via algorithmic improvements or caching.
System‑level CPU: keep below 10% idle overhead.
I/O wait: stay under 5%.
Best Practices
Layered investigation – start with global load, then process, finally code.
Sample‑driven analysis – use perf or flame graphs before any optimisation.
Isolate problematic processes with cgroups.
Weekly review of sar history to spot trends.
Balance multi‑process/thread scaling with lock contention awareness.
Version‑control kernel, driver, and application changes for easy rollback.
Validate optimisations with load testing.
Cheat‑Sheet of Common Commands
# Real‑time monitoring
top | htop | watch 'mpstat -P ALL 1 1'
# Historical data
sar -u -f /var/log/sa/saXX
# Deep process analysis
ps aux | ps -eLf | top -H -p PID | ps -p PID -o tid,%cpu,cmd
# System‑call tracing
strace -p PID -c -S time
sudo perf record -p PID -g
# Flame graph generation
sudo perf script | stackcollapse-perf.pl | flamegraph.pl > cpu_flame.svg
# Interrupts & load
cat /proc/interrupts | vmstat 1 | uptime
# Scheduler statistics
cat /proc/sched_debug | pidstat -w 1In summary, diagnosing Linux CPU spikes requires a hierarchical approach: check overall load, isolate the offending process, then drill down with tracing and profiling tools. Mastering the ten root‑cause patterns and the associated commands lets you pinpoint the problem within minutes, while long‑term sar and Prometheus monitoring helps prevent future incidents.
Raymond Ops
Linux ops automation, cloud-native, Kubernetes, SRE, DevOps, Python, Golang and related tech discussions.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
