10 Proven Causes of Linux CPU Spikes and How to Diagnose Them Fast
Learn a step‑by‑step Linux CPU high‑usage diagnostic guide covering ten root causes, quick monitoring commands, deep analysis with top, ps, strace, perf, and flamegraphs, plus practical remediation and long‑term monitoring setup using sar and Prometheus to prevent future spikes.
Applicable Scenarios and Prerequisites
Application scenario: CPU usage > 80%, a process unexpectedly consumes large CPU, system response slows.
Prerequisites: Linux RHEL 7+/Ubuntu 18.04+, root or sudo rights, common monitoring tools (top, perf, strace).
Performance baseline: single‑core full load equals 100%; on multi‑core each core is calculated independently.
Environment and Version Matrix
Tools/commands for different distributions:
top – native on RHEL 7/8 and Ubuntu 18.04 – real‑time process monitoring.
htop – yum (RHEL), dnf (RHEL 8), apt (Ubuntu) – enhanced top.
perf – yum install perf (RHEL), dnf (RHEL 8), apt (Ubuntu) – CPU flame graphs.
sar – sysstat package – historical system statistics.
strace – yum/apt – system call tracing.
mpstat – sysstat – per‑core CPU monitoring.
iostat – sysstat – I/O‑related CPU usage.
Quick Checklist
Step 1 : Monitor overall CPU usage and load balance in real time.
Step 2 : Identify the process consuming the most CPU.
Step 3 : Deep‑dive into the problematic process’s threads and system calls.
Step 4 : Collect historical performance data to spot patterns.
Step 5 : Analyze code hotspots with flame graphs.
Step 6 : Determine root cause (business logic, infinite loop, I/O wait, kernel bug, etc.).
Step 7 : Implement fix and rollback strategy.
Step 8 : Establish long‑term monitoring and alerts.
Implementation Steps
Step 1: Quickly Diagnose Overall CPU State
Check system load average:
# Method 1: use uptime
uptime
# Method 2: view /proc/loadavg
cat /proc/loadavgExpected output:
10:45:32 up 10 days, 3:20, 2 users, load average: 2.45, 2.30, 2.15Parameter explanation: load average: 2.45, 2.30, 2.15 – 1‑minute, 5‑minute, 15‑minute average load.
Load = CPU busy + waiting queue.
Judgment rule : load > number of CPU cores indicates saturation.
Check CPU core count:
nproc
# or
grep -c '^processor' /proc/cpuinfoView per‑core CPU usage (real‑time):
# Method 1: mpstat (requires sysstat)
sudo mpstat -P ALL 1 5
# Method 2: top
topmpstat expected output:
CPU %usr %nice %sys %iowait %irq %soft %guest %idle
0 45.2 0.0 8.5 2.1 0.0 0.1 0.0 44.1
1 78.9 0.0 15.3 2.1 0.0 0.1 0.0 3.6
2 12.3 0.0 5.6 1.2 0.0 0.0 0.0 80.9Key fields: %usr: user‑mode CPU (business processes). %sys: system‑mode CPU (kernel calls). %iowait: CPU waiting for I/O. %idle: idle CPU.
Step 2: Locate High‑CPU Process
Use top to find high‑CPU processes: top Interactive commands: Shift+P: sort by CPU usage (default). Shift+M: sort by memory usage. d 1: set refresh interval to 1 second. q: quit.
Sample output:
PID USER PR NI VIRT RES %CPU %MEM TIME+ COMMAND
5678 www 20 0 512m 256m 95.2 5.6 45:23 java -jar app.jar
1234 mysql 20 0 1.5g 800m 12.3 22.1 102:45 /usr/sbin/mysqldUse ps for a static snapshot:
# List all processes sorted by CPU (top 10)
ps aux --sort=-%cpu | head -10
# Show CPU usage of a specific PID
ps -p 5678 -o pid,cmd,%cpu,%mem
# Show all threads of a process
ps -p 5678 -L -o pid,tid,cmd,%cpuSample ps output:
PID CMD %CPU %MEM
5678 java -jar app.jar 95.2 5.6Thread‑level hotspot (multithreaded apps):
# List thread info
ps -eLf | grep 5678
# Real‑time thread CPU monitoring
top -p 5678 -HStep 3: Deep Analysis of System Calls
Trace system calls with strace:
# Trace all syscalls of a process and summarize
sudo strace -p 5678 -e trace=all -c
# Focus on time‑consuming syscalls
sudo strace -p 5678 -c -S timeSample strace -c output:
% time seconds usecs/call calls errors syscall
------ ----------- ----------- --------- --------- --------
45.20 2.451234 245 10000 0 futex
32.10 1.743210 174 10010 0 poll
15.30 0.831245 83 10020 0 read
5.40 0.293215 29 10030 0 writeParameter notes: futex: thread synchronization/lock contention. poll/epoll: I/O multiplexing wait.
High %time calls need optimization.
Generate CPU flame graph with perf:
# Install perf (RHEL/CentOS)
sudo yum install -y perf
# or Ubuntu
sudo apt-get install -y linux-tools-generic
# Sample a specific process for 30 seconds
sudo perf record -p 5678 -F 99 sleep 30
# Generate report
sudo perf report
# Export to flame‑graph format (requires processing)
sudo perf script > out.perfperf report navigation: ↓↑: scroll. Enter: expand details. q: quit.
Step 4: Collect Historical CPU Data
Use sar to view historical CPU data:
# Enable sysstat first (cron or service)
sudo yum install -y sysstat
# Today’s CPU history (10‑minute interval)
sar -u -f /var/log/sa/sa$(date +%d)
# Past 7 days hourly average
sar -u -f /var/log/sa/sa01 -b
# Real‑time 1‑second interval
sar -u 1 10Sample sar output:
10:30:00 AM CPU %user %nice %system %iowait %steal %idle
10:31:00 AM all 45.23 0.00 8.45 2.34 0.00 43.98
10:32:00 AM all 48.12 0.01 9.21 1.87 0.00 40.79Identify CPU peak patterns:
# View last 24 hours trend
sar -u -f /var/log/sa/sa$(date +%d) | tail -20Step 5: Analyze Code Hotspots (Flame Graph)
Generate CPU flame graph for a Java application:
# Step 1: install FlameGraph tool
git clone https://github.com/brendangregg/FlameGraph.git
export PATH=$PATH:$(pwd)/FlameGraph
# Step 2: sample with perf
sudo perf record -F 99 -p 5678 -g -- sleep 30
# Step 3: export call stacks
sudo perf script > out.perf
# Step 4: generate flame graph
stackcollapse-perf.pl out.perf | flamegraph.pl > cpu_flame.svg
# Step 5: view in browserFlame‑graph interpretation:
Horizontal width: function call frequency (wider = more CPU time).
Vertical height: call‑stack depth.
Interactive zoom and reset available.
Step 6: Ten Major Root Causes of CPU Spikes
Root Cause
Symptoms
Diagnostic Command
Judgment Standard
Quick Fix
1. Business‑code CPU intensive
CPU stays high, %usr high perf report Flame graph shows business functions >70%
Optimize algorithm / add cache
2. Infinite loop / recursion
Single process CPU 100%, no I/O wait strace -p PID -c One syscall frequency extremely high
Inspect code, add logging
3. Lock contention
%sys high, many futex calls strace -p PID -e futex High‑frequency futex with noticeable latency
Reduce critical sections, use finer‑grained locks
4. Frequent context switches
CPU high but not saturated, many processes vmstat 1 | watch -n 1 cs (context switches) > 50000/s
Limit thread count / set CPU affinity
5. Excessive interrupt handling
%irq / %soft high cat /proc/interrupts Network or storage interrupts dominate
Adjust NIC / storage driver parameters
6. I/O‑bound (hidden CPU)
%iowait high, many processes waiting for I/O iostat -x 1 wa > 20%
Optimize I/O pattern or switch to SSD
7. Kernel memory reclaim
%sys high, low free memory vmstat 1 | awk '{print $8}' kswapd consumes CPU
Add memory or tune application memory usage
8. Network packet processing
%irq / %soft high, high network traffic sar -n DEV 1 10 Rxpck/s > 100k
Adjust NIC interrupt coalescing, add queues
9. Process scheduling jitter
CPU usage fluctuates, low cache hit rate perf stat -p PID -e cache-misses cache miss > 30%
Bind CPU, improve memory access pattern
10. Kernel bug / driver issue
No obvious load but CPU high sudo dmesg | tail Kernel warning/error logs
Upgrade kernel / driver version
Step 7: Implement Fixes
Solution A – Process‑level optimization (no restart):
# 1. Limit process CPU using cgroup
cgcreate -g cpu:/limited_app
cgset -r cpu.cfs_quota_us=80000 /limited_app # 80% of one CPU
cgexec -g cpu:/limited_app /opt/app/start.sh
# 2. Set CPU affinity to avoid migration
taskset -pc 0,1,2 5678 # bind PID 5678 to CPUs 0,1,2
# 3. Adjust process priority
nice -n 10 /opt/app/start.sh # lower priority
renice -n 10 -p 5678Solution B – System‑level tuning:
# 1. Disable CPU frequency scaling (set to performance)
sudo vi /etc/default/grub # add: intel_pstate=disable cpufreq=performance
sudo grub2-mkconfig -o /boot/grub2/grub.cfg
sudo reboot
# 2. Change I/O scheduler
echo deadline > /sys/block/sda/queue/scheduler
# 3. Reduce unnecessary interrupts
sudo ethtool -C eth0 rx-usecs 500Solution C – Application code optimization (Python example):
# Issue: frequent global GIL lock contention
# Fix 1: use multiprocessing instead of multithreading
from multiprocessing import Process
def worker(data):
# CPU‑intensive work
pass
if __name__ == '__main__':
processes = [Process(target=worker, args=(chunk,)) for chunk in data_chunks]
for p in processes:
p.start()
# Fix 2: use C extensions or NumPy to bypass GIL
import numpy as np
result = np.dot(matrix1, matrix2) # parallel computationStep 8: Set Up Monitoring and Alerts
Prometheus monitoring configuration:
# prometheus.yml
scrape_configs:
- job_name: 'node'
static_configs:
- targets: ['localhost:9100']
# alert_rules.yml
groups:
- name: cpu_alerts
rules:
- alert: CPUHigh
expr: (100 - avg by (instance) (irate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 80
for: 5m
annotations:
summary: "High CPU on {{ $labels.instance }}"
- alert: LoadAverage
expr: node_load1 > (count by (instance) (node_cpu_seconds_total{mode="idle"}) * 1.5)
for: 5m
annotations:
summary: "High load average on {{ $labels.instance }}"Grafana dashboard metrics:
CPU usage trend (last 24 h).
Per‑core CPU distribution.
Top 10 processes by CPU.
Load balance and scheduling latency.
Performance Optimization Benchmarks
CPU optimization goals:
Business CPU: reduce 20‑50% via algorithm/cache improvements.
System CPU: keep below 10% for normal ops.
I/O wait: stay under 5% with efficient disk/network config.
Best Practices
Layered troubleshooting – start with global load, then process, finally code.
Sampling‑based analysis – use perf/flame‑graph to locate real hotspots before optimizing.
Isolate problematic processes with cgroup to protect critical services.
Regular audits – weekly sar review to catch abnormal trends.
Load balancing – leverage multi‑process/thread to fully utilize cores while avoiding lock contention.
Version control – record kernel/driver/application versions for each optimization.
Test verification – stress‑test new fixes to ensure no side effects.
Appendix: Common Command Quick Reference
# Real‑time monitoring
top / htop / watch 'mpstat -P ALL 1 1'
# Historical data
sar -u -f /var/log/sa/saXX
# Deep process analysis
ps aux / ps -eLf / top -H -p PID / ps -p PID -o tid,%cpu,cmd
# System call tracing
strace -p PID -c -S time / sudo perf record -p PID -g
# Flame graph
sudo perf script | stackcollapse-perf.pl | flamegraph.pl
# Interrupt / load
cat /proc/interrupts / vmstat 1 / uptime
# Scheduler stats
cat /proc/sched_debug / pidstat -w 1Summary: CPU spikes require a layered approach – first examine system‑wide load and core distribution, then pinpoint the offending process with top/ps, and finally drill into code using perf and flame graphs. Knowing the ten major root causes and their diagnostic commands lets you locate the problem in minutes, while long‑term sar/Prometheus monitoring helps catch trends early and prevent issues.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
MaGe Linux Operations
Founded in 2009, MaGe Education is a top Chinese high‑end IT training brand. Its graduates earn 12K+ RMB salaries, and the school has trained tens of thousands of students. It offers high‑pay courses in Linux cloud operations, Python full‑stack, automation, data analysis, AI, and Go high‑concurrency architecture. Thanks to quality courses and a solid reputation, it has talent partnerships with numerous internet firms.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
