Operations 22 min read

10 Proven Causes of Linux CPU Spikes and How to Diagnose Them Fast

Learn a step‑by‑step Linux CPU high‑usage diagnostic guide covering ten root causes, quick monitoring commands, deep analysis with top, ps, strace, perf, and flamegraphs, plus practical remediation and long‑term monitoring setup using sar and Prometheus to prevent future spikes.

MaGe Linux Operations
MaGe Linux Operations
MaGe Linux Operations
10 Proven Causes of Linux CPU Spikes and How to Diagnose Them Fast

Applicable Scenarios and Prerequisites

Application scenario: CPU usage > 80%, a process unexpectedly consumes large CPU, system response slows.

Prerequisites: Linux RHEL 7+/Ubuntu 18.04+, root or sudo rights, common monitoring tools (top, perf, strace).

Performance baseline: single‑core full load equals 100%; on multi‑core each core is calculated independently.

Environment and Version Matrix

Tools/commands for different distributions:

top – native on RHEL 7/8 and Ubuntu 18.04 – real‑time process monitoring.

htop – yum (RHEL), dnf (RHEL 8), apt (Ubuntu) – enhanced top.

perf – yum install perf (RHEL), dnf (RHEL 8), apt (Ubuntu) – CPU flame graphs.

sar – sysstat package – historical system statistics.

strace – yum/apt – system call tracing.

mpstat – sysstat – per‑core CPU monitoring.

iostat – sysstat – I/O‑related CPU usage.

Quick Checklist

Step 1 : Monitor overall CPU usage and load balance in real time.

Step 2 : Identify the process consuming the most CPU.

Step 3 : Deep‑dive into the problematic process’s threads and system calls.

Step 4 : Collect historical performance data to spot patterns.

Step 5 : Analyze code hotspots with flame graphs.

Step 6 : Determine root cause (business logic, infinite loop, I/O wait, kernel bug, etc.).

Step 7 : Implement fix and rollback strategy.

Step 8 : Establish long‑term monitoring and alerts.

Implementation Steps

Step 1: Quickly Diagnose Overall CPU State

Check system load average:

# Method 1: use uptime
uptime

# Method 2: view /proc/loadavg
cat /proc/loadavg

Expected output:

10:45:32 up 10 days, 3:20, 2 users, load average: 2.45, 2.30, 2.15

Parameter explanation: load average: 2.45, 2.30, 2.15 – 1‑minute, 5‑minute, 15‑minute average load.

Load = CPU busy + waiting queue.

Judgment rule : load > number of CPU cores indicates saturation.

Check CPU core count:

nproc
# or
grep -c '^processor' /proc/cpuinfo

View per‑core CPU usage (real‑time):

# Method 1: mpstat (requires sysstat)
sudo mpstat -P ALL 1 5

# Method 2: top
top

mpstat expected output:

CPU   %usr  %nice  %sys  %iowait  %irq  %soft  %guest  %idle
 0    45.2   0.0    8.5     2.1    0.0   0.1    0.0    44.1
 1    78.9   0.0   15.3     2.1    0.0   0.1    0.0     3.6
 2    12.3   0.0    5.6     1.2    0.0   0.0    0.0    80.9

Key fields: %usr: user‑mode CPU (business processes). %sys: system‑mode CPU (kernel calls). %iowait: CPU waiting for I/O. %idle: idle CPU.

Step 2: Locate High‑CPU Process

Use top to find high‑CPU processes: top Interactive commands: Shift+P: sort by CPU usage (default). Shift+M: sort by memory usage. d 1: set refresh interval to 1 second. q: quit.

Sample output:

PID USER   PR NI   VIRT   RES  %CPU %MEM TIME+ COMMAND
5678 www    20  0  512m  256m  95.2  5.6 45:23 java -jar app.jar
1234 mysql  20  0  1.5g  800m  12.3 22.1 102:45 /usr/sbin/mysqld

Use ps for a static snapshot:

# List all processes sorted by CPU (top 10)
ps aux --sort=-%cpu | head -10

# Show CPU usage of a specific PID
ps -p 5678 -o pid,cmd,%cpu,%mem

# Show all threads of a process
ps -p 5678 -L -o pid,tid,cmd,%cpu

Sample ps output:

PID CMD                %CPU %MEM
5678 java -jar app.jar 95.2  5.6

Thread‑level hotspot (multithreaded apps):

# List thread info
ps -eLf | grep 5678

# Real‑time thread CPU monitoring
top -p 5678 -H

Step 3: Deep Analysis of System Calls

Trace system calls with strace:

# Trace all syscalls of a process and summarize
sudo strace -p 5678 -e trace=all -c

# Focus on time‑consuming syscalls
sudo strace -p 5678 -c -S time

Sample strace -c output:

% time   seconds  usecs/call   calls   errors  syscall
------ ----------- ----------- --------- --------- --------
 45.20   2.451234      245    10000      0    futex
 32.10   1.743210      174    10010      0    poll
 15.30   0.831245       83    10020      0    read
  5.40   0.293215       29    10030      0    write

Parameter notes: futex: thread synchronization/lock contention. poll/epoll: I/O multiplexing wait.

High %time calls need optimization.

Generate CPU flame graph with perf:

# Install perf (RHEL/CentOS)
sudo yum install -y perf
# or Ubuntu
sudo apt-get install -y linux-tools-generic

# Sample a specific process for 30 seconds
sudo perf record -p 5678 -F 99 sleep 30

# Generate report
sudo perf report

# Export to flame‑graph format (requires processing)
sudo perf script > out.perf

perf report navigation: ↓↑: scroll. Enter: expand details. q: quit.

Step 4: Collect Historical CPU Data

Use sar to view historical CPU data:

# Enable sysstat first (cron or service)
sudo yum install -y sysstat

# Today’s CPU history (10‑minute interval)
sar -u -f /var/log/sa/sa$(date +%d)

# Past 7 days hourly average
sar -u -f /var/log/sa/sa01 -b

# Real‑time 1‑second interval
sar -u 1 10

Sample sar output:

10:30:00 AM  CPU  %user  %nice %system %iowait %steal %idle
10:31:00 AM  all   45.23   0.00   8.45   2.34   0.00  43.98
10:32:00 AM  all   48.12   0.01   9.21   1.87   0.00  40.79

Identify CPU peak patterns:

# View last 24 hours trend
sar -u -f /var/log/sa/sa$(date +%d) | tail -20

Step 5: Analyze Code Hotspots (Flame Graph)

Generate CPU flame graph for a Java application:

# Step 1: install FlameGraph tool
git clone https://github.com/brendangregg/FlameGraph.git
export PATH=$PATH:$(pwd)/FlameGraph

# Step 2: sample with perf
sudo perf record -F 99 -p 5678 -g -- sleep 30

# Step 3: export call stacks
sudo perf script > out.perf

# Step 4: generate flame graph
stackcollapse-perf.pl out.perf | flamegraph.pl > cpu_flame.svg

# Step 5: view in browser

Flame‑graph interpretation:

Horizontal width: function call frequency (wider = more CPU time).

Vertical height: call‑stack depth.

Interactive zoom and reset available.

Step 6: Ten Major Root Causes of CPU Spikes

Root Cause

Symptoms

Diagnostic Command

Judgment Standard

Quick Fix

1. Business‑code CPU intensive

CPU stays high, %usr high perf report Flame graph shows business functions >70%

Optimize algorithm / add cache

2. Infinite loop / recursion

Single process CPU 100%, no I/O wait strace -p PID -c One syscall frequency extremely high

Inspect code, add logging

3. Lock contention

%sys high, many futex calls strace -p PID -e futex High‑frequency futex with noticeable latency

Reduce critical sections, use finer‑grained locks

4. Frequent context switches

CPU high but not saturated, many processes vmstat 1 | watch -n 1 cs (context switches) > 50000/s

Limit thread count / set CPU affinity

5. Excessive interrupt handling

%irq / %soft high cat /proc/interrupts Network or storage interrupts dominate

Adjust NIC / storage driver parameters

6. I/O‑bound (hidden CPU)

%iowait high, many processes waiting for I/O iostat -x 1 wa > 20%

Optimize I/O pattern or switch to SSD

7. Kernel memory reclaim

%sys high, low free memory vmstat 1 | awk '{print $8}' kswapd consumes CPU

Add memory or tune application memory usage

8. Network packet processing

%irq / %soft high, high network traffic sar -n DEV 1 10 Rxpck/s > 100k

Adjust NIC interrupt coalescing, add queues

9. Process scheduling jitter

CPU usage fluctuates, low cache hit rate perf stat -p PID -e cache-misses cache miss > 30%

Bind CPU, improve memory access pattern

10. Kernel bug / driver issue

No obvious load but CPU high sudo dmesg | tail Kernel warning/error logs

Upgrade kernel / driver version

Step 7: Implement Fixes

Solution A – Process‑level optimization (no restart):

# 1. Limit process CPU using cgroup
cgcreate -g cpu:/limited_app
cgset -r cpu.cfs_quota_us=80000 /limited_app   # 80% of one CPU
cgexec -g cpu:/limited_app /opt/app/start.sh

# 2. Set CPU affinity to avoid migration
taskset -pc 0,1,2 5678   # bind PID 5678 to CPUs 0,1,2

# 3. Adjust process priority
nice -n 10 /opt/app/start.sh   # lower priority
renice -n 10 -p 5678

Solution B – System‑level tuning:

# 1. Disable CPU frequency scaling (set to performance)
sudo vi /etc/default/grub   # add: intel_pstate=disable cpufreq=performance
sudo grub2-mkconfig -o /boot/grub2/grub.cfg
sudo reboot

# 2. Change I/O scheduler
echo deadline > /sys/block/sda/queue/scheduler

# 3. Reduce unnecessary interrupts
sudo ethtool -C eth0 rx-usecs 500

Solution C – Application code optimization (Python example):

# Issue: frequent global GIL lock contention
# Fix 1: use multiprocessing instead of multithreading
from multiprocessing import Process

def worker(data):
    # CPU‑intensive work
    pass

if __name__ == '__main__':
    processes = [Process(target=worker, args=(chunk,)) for chunk in data_chunks]
    for p in processes:
        p.start()

# Fix 2: use C extensions or NumPy to bypass GIL
import numpy as np
result = np.dot(matrix1, matrix2)   # parallel computation

Step 8: Set Up Monitoring and Alerts

Prometheus monitoring configuration:

# prometheus.yml
scrape_configs:
- job_name: 'node'
  static_configs:
  - targets: ['localhost:9100']

# alert_rules.yml
groups:
- name: cpu_alerts
  rules:
  - alert: CPUHigh
    expr: (100 - avg by (instance) (irate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 80
    for: 5m
    annotations:
      summary: "High CPU on {{ $labels.instance }}"
  - alert: LoadAverage
    expr: node_load1 > (count by (instance) (node_cpu_seconds_total{mode="idle"}) * 1.5)
    for: 5m
    annotations:
      summary: "High load average on {{ $labels.instance }}"

Grafana dashboard metrics:

CPU usage trend (last 24 h).

Per‑core CPU distribution.

Top 10 processes by CPU.

Load balance and scheduling latency.

Performance Optimization Benchmarks

CPU optimization goals:

Business CPU: reduce 20‑50% via algorithm/cache improvements.

System CPU: keep below 10% for normal ops.

I/O wait: stay under 5% with efficient disk/network config.

Best Practices

Layered troubleshooting – start with global load, then process, finally code.

Sampling‑based analysis – use perf/flame‑graph to locate real hotspots before optimizing.

Isolate problematic processes with cgroup to protect critical services.

Regular audits – weekly sar review to catch abnormal trends.

Load balancing – leverage multi‑process/thread to fully utilize cores while avoiding lock contention.

Version control – record kernel/driver/application versions for each optimization.

Test verification – stress‑test new fixes to ensure no side effects.

Appendix: Common Command Quick Reference

# Real‑time monitoring
top / htop / watch 'mpstat -P ALL 1 1'

# Historical data
sar -u -f /var/log/sa/saXX

# Deep process analysis
ps aux / ps -eLf / top -H -p PID / ps -p PID -o tid,%cpu,cmd

# System call tracing
strace -p PID -c -S time / sudo perf record -p PID -g

# Flame graph
sudo perf script | stackcollapse-perf.pl | flamegraph.pl

# Interrupt / load
cat /proc/interrupts / vmstat 1 / uptime

# Scheduler stats
cat /proc/sched_debug / pidstat -w 1

Summary: CPU spikes require a layered approach – first examine system‑wide load and core distribution, then pinpoint the offending process with top/ps, and finally drill into code using perf and flame graphs. Knowing the ten major root causes and their diagnostic commands lets you locate the problem in minutes, while long‑term sar/Prometheus monitoring helps catch trends early and prevent issues.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

linuxPrometheustroubleshootingCPUperf
MaGe Linux Operations
Written by

MaGe Linux Operations

Founded in 2009, MaGe Education is a top Chinese high‑end IT training brand. Its graduates earn 12K+ RMB salaries, and the school has trained tens of thousands of students. It offers high‑pay courses in Linux cloud operations, Python full‑stack, automation, data analysis, AI, and Go high‑concurrency architecture. Thanks to quality courses and a solid reputation, it has talent partnerships with numerous internet firms.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.