Operations 19 min read

How to Quickly Diagnose and Fix High CPU Usage on Linux: 10 Root Causes & Step‑by‑Step Guide

This guide walks you through detecting, analyzing, and resolving Linux CPU spikes by monitoring overall load, pinpointing the offending process, drilling down with tools like top, ps, strace, perf, and sar, and applying targeted fixes for the ten most common causes.

Raymond Ops
Raymond Ops
Raymond Ops
How to Quickly Diagnose and Fix High CPU Usage on Linux: 10 Root Causes & Step‑by‑Step Guide

Background and Scope

When a Linux server shows CPU utilization above 80%, a single process may be hogging CPU, the system becomes sluggish, or services start timing out. The guide assumes RHEL 7+/Ubuntu 18.04+ with root or sudo privileges and common monitoring tools (top, perf, strace, sar, etc.). A single CPU core at 100% is the baseline; on multi‑core systems each core is evaluated independently.

Tool Matrix

top – native real‑time process monitor.

htop – enhanced top (yum/apt install).

perf – CPU flame‑graph generation (yum install perf, apt install linux‑tools).

sar – historical system statistics (sysstat package).

strace – system‑call tracing.

mpstat / iostat / vmstat – per‑core and I/O metrics.

Quick Checklist

Monitor overall CPU usage and load average.

Identify the process consuming the most CPU.

Analyze the process’s threads and system calls.

Collect historical performance data.

Generate and interpret flame graphs.

Determine the root cause (e.g., code inefficiency, lock contention, I/O wait).

Apply a fix and plan a rollback.

Set up long‑term monitoring and alerts.

Step‑by‑Step Implementation

Step 1 – Quick Diagnosis of Overall CPU State

Check load average:

# Method 1: uptime
uptime

# Method 2: view /proc/loadavg
cat /proc/loadavg

Typical output:

10:45:32 up 10 days, 3:20, 2 users, load average: 2.45, 2.30, 2.15

Interpretation: load average: 2.45, 2.30, 2.15 – 1‑, 5‑, 15‑minute averages.

Load > number of CPU cores indicates saturation.

Show CPU core count:

nproc
# or
grep -c '^processor' /proc/cpuinfo

Show per‑core utilization (real‑time):

# Method 1: mpstat (requires sysstat)
sudo mpstat -P ALL 1 5

# Method 2: top
top

Sample mpstat output:

CPU   %usr  %nice  %sys  %iowait  %irq  %soft  %guest  %idle
0    45.2   0.0    8.5     2.1    0.0   0.1    0.0    44.1
1    78.9   0.0   15.3     2.1    0.0   0.1    0.0     3.6
2    12.3   0.0    5.6     1.2    0.0   0.0    0.0    80.9

Key fields: %usr – user‑space CPU (application). %sys – kernel CPU. %iowait – CPU waiting for I/O. %idle – idle CPU.

Step 2 – Locate High‑CPU Process

Use top and sort by CPU: top Interactive shortcuts: Shift+P – sort by CPU (default). Shift+M – sort by memory. d 1 – refresh every second. q – quit.

Sample top output:

PID USER   PR NI   VIRT   RES  %CPU %MEM TIME+ COMMAND
5678 www    20  0  512m  256m  95.2  5.6 45:23 java -jar app.jar
1234 mysql  20  0  1.5g  800m  12.3 22.1 102:45 /usr/sbin/mysqld

Static snapshot with ps:

# Top 10 CPU consumers
ps aux --sort=-%cpu | head -10

# Inspect a specific PID
ps -p 5678 -o pid,cmd,%cpu,%mem

# Show all threads of a PID
ps -p 5678 -L -o pid,tid,cmd,%cpu

Step 3 – Deep Dive with System‑Call Tracing

Trace all syscalls of the offending process:

# Full trace with statistics
sudo strace -p 5678 -e trace=all -c

# Focus on time‑consuming calls
sudo strace -p 5678 -c -S time

Typical strace -c output highlights high‑frequency calls such as futex, poll, read, and write. A high %time indicates a hotspot that needs optimization.

Step 4 – Generate CPU Flame Graph (Java Example)

Install the FlameGraph tools and collect perf data:

# Clone the FlameGraph repository
git clone https://github.com/brendangregg/FlameGraph.git
export PATH=$PATH:$(pwd)/FlameGraph

# Record 30 seconds of CPU samples for PID 5678
sudo perf record -F 99 -p 5678 -g -- sleep 30

# Convert to a flame‑graph
sudo perf script > out.perf
stackcollapse-perf.pl out.perf | flamegraph.pl > cpu_flame.svg

Interpretation:

Width = total CPU time spent in a function (wider = hotter).

Height = call‑stack depth.

Clickable regions allow zooming for detailed analysis.

Step 5 – Ten Common Root Causes and Diagnosis Commands

1. CPU‑intensive business code – Symptom: high %usr. Command: perf report. Fix: optimise algorithm or add caching.

2. Infinite loop / recursion – Symptom: single process at 100% CPU, no I/O wait. Command: strace -p PID -c. Fix: review code, add logging.

3. Lock contention – Symptom: high %sys, many futex calls. Command: strace -p PID -e futex. Fix: reduce critical sections, use finer‑grained locks.

4. Frequent context switches – Symptom: high CPU but not fully loaded, many processes. Command: vmstat 1 | watch -n 1. Fix: limit thread count or set CPU affinity.

5. Excessive interrupts – Symptom: high %irq/%soft. Command: cat /proc/interrupts. Fix: tune NIC or storage driver parameters.

6. I/O‑bound workload masquerading as CPU load – Symptom: high %iowait. Command: iostat -x 1. Fix: optimise I/O path or upgrade to SSD.

7. Kernel memory reclamation (kswapd) – Symptom: high %sys, low free memory. Command: vmstat 1 | awk '{print $8}'. Fix: add RAM or improve application memory usage.

8. Network packet processing – Symptom: high %irq/%soft, large traffic. Command: sar -n DEV 1 10. Fix: adjust NIC interrupt coalescing, add queues.

9. Scheduler jitter – Symptom: erratic CPU usage, high cache misses. Command: perf stat -p PID -e cache-misses. Fix: bind process to CPUs, improve memory access patterns.

10. Kernel bug / driver issue – Symptom: unexplained CPU spikes. Command: sudo dmesg | tail. Fix: upgrade kernel or driver.

Step 6 – Apply Fixes

Option A – Process‑level optimisation (no restart)

# Limit CPU with cgroup
cgcreate -g cpu:/limited_app
cgset -r cpu.cfs_quota_us=80000 /limited_app   # 80% of one CPU
cgexec -g cpu:/limited_app /opt/app/start.sh

# Set CPU affinity
taskset -pc 0,1,2 5678   # bind to CPUs 0‑2

# Adjust nice level
nice -n 10 /opt/app/start.sh   # lower priority
renice -n 10 -p 5678

Option B – System‑wide tuning

# Disable CPU frequency scaling (set performance mode)
sudo vi /etc/default/grub   # add: intel_pstate=disable cpufreq=performance
sudo grub2-mkconfig -o /boot/grub2/grub.cfg
sudo reboot

# Change I/O scheduler
echo deadline > /sys/block/sda/queue/scheduler

# Reduce unnecessary interrupts (example for NIC)
sudo ethtool -C eth0 rx-usecs 500

Option C – Application code optimisation (Python example)

# Replace threads with processes to avoid GIL contention
from multiprocessing import Process

def worker(data):
    # CPU‑intensive work here
    pass

if __name__ == '__main__':
    processes = [Process(target=worker, args=(chunk,)) for chunk in data_chunks]
    for p in processes:
        p.start()

# Or use NumPy/C extensions to bypass GIL
import numpy as np
result = np.dot(matrix1, matrix2)   # runs in native code

Step 7 – Set Up Continuous Monitoring & Alerts

Prometheus node exporter configuration (simplified):

# prometheus.yml
scrape_configs:
  - job_name: 'node'
    static_configs:
      - targets: ['localhost:9100']

# alert_rules.yml
groups:
  - name: cpu_alerts
    rules:
      - alert: CPUHigh
        expr: (100 - avg by (instance) (irate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 80
        for: 5m
        annotations:
          summary: "High CPU on {{ $labels.instance }}"
      - alert: LoadAverage
        expr: node_load1 > (count by (instance) (node_cpu_seconds_total{mode="idle"}) * 1.5)
        for: 5m
        annotations:
          summary: "High load average on {{ $labels.instance }}"

Grafana dashboards should display:

CPU usage trend (last 24 h).

Per‑core distribution.

Top 10 processes by CPU.

Load‑average and scheduler latency.

Performance Targets

Business‑level CPU: reduce 20‑50% via algorithmic improvements or caching.

System‑level CPU: keep below 10% idle overhead.

I/O wait: stay under 5%.

Best Practices

Layered investigation – start with global load, then process, finally code.

Sample‑driven analysis – use perf or flame graphs before any optimisation.

Isolate problematic processes with cgroups.

Weekly review of sar history to spot trends.

Balance multi‑process/thread scaling with lock contention awareness.

Version‑control kernel, driver, and application changes for easy rollback.

Validate optimisations with load testing.

Cheat‑Sheet of Common Commands

# Real‑time monitoring
top | htop | watch 'mpstat -P ALL 1 1'

# Historical data
sar -u -f /var/log/sa/saXX

# Deep process analysis
ps aux | ps -eLf | top -H -p PID | ps -p PID -o tid,%cpu,cmd

# System‑call tracing
strace -p PID -c -S time
sudo perf record -p PID -g

# Flame graph generation
sudo perf script | stackcollapse-perf.pl | flamegraph.pl > cpu_flame.svg

# Interrupts & load
cat /proc/interrupts | vmstat 1 | uptime

# Scheduler statistics
cat /proc/sched_debug | pidstat -w 1

In summary, diagnosing Linux CPU spikes requires a hierarchical approach: check overall load, isolate the offending process, then drill down with tracing and profiling tools. Mastering the ten root‑cause patterns and the associated commands lets you pinpoint the problem within minutes, while long‑term sar and Prometheus monitoring helps prevent future incidents.

LinuxTroubleshootingCPU
Raymond Ops
Written by

Raymond Ops

Linux ops automation, cloud-native, Kubernetes, SRE, DevOps, Python, Golang and related tech discussions.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.