Operations 33 min read

How to Diagnose Linux Server CPU Spikes: A Practical Step‑by‑Step Guide

This article presents a systematic, evidence‑driven process for locating and resolving high CPU usage on Linux servers, covering environment preparation, layered troubleshooting from whole‑machine to thread level, concrete command examples, real‑world case studies, best‑practice recommendations, and monitoring configurations.

Raymond Ops
Raymond Ops
Raymond Ops
How to Diagnose Linux Server CPU Spikes: A Practical Step‑by‑Step Guide

Overview

CPU spikes on Linux servers are rarely caused by a single metric; they usually involve a chain of factors such as request amplification, thread‑model loss of control, kernel soft‑interrupt buildup, or disk/network jitter that finally pushes the CPU into system‑state usage. The alert typically shows only CPU usage > 90% on the monitoring platform.

1. Preparation

Install the diagnostic toolkit on the target host (or ensure it is already present): procps-ng, sysstat, perf, strace, lsof, iotop, dstat, tcpdump, linux‑cpupower Enable sysstat permanently so that historical data is retained (e.g., set HISTORY=28 in /etc/sysconfig/sysstat and start the related timers).

2. Capture the Current State

Run a fixed set of commands to snapshot the system at the moment of the alarm. Save the output to a timestamped directory for later replay.

date
hostname -f
uptime
top -b -n 1 | head -40
mpstat -P ALL 1 3
vmstat 1 5
sar -u 1 5
sar -n DEV,EDEV,TCP,ETCP 1 5
free -h
df -h
dmesg -T | tail -200

If the host runs containers, also collect cgroup and kubelet information:

kubectl top pod -A --containers | sort -k3 -hr | head -20
kubectl describe pod <code>pod-name</code> -n <code>namespace</code>
cat /sys/fs/cgroup/cpu.max 2>/dev/null || cat /sys/fs/cgroup/cpu/cpu.cfs_quota_us
cat /sys/fs/cgroup/cpu.stat 2>/dev/null || cat /sys/fs/cgroup/cpu/cpu.stat

3. Layered Diagnosis

3.1 Whole‑Machine vs. Process

If all cores show high usage and many business processes run, look at traffic, batch jobs, or host‑level interrupt storms.

If only 1‑2 cores are hot, suspect a single‑thread hotspot, lock contention, or uneven CPU pinning.

If the machine CPU is moderate but a single process is high, jump to thread‑level analysis.

If %sys is high while the process view is quiet, investigate interrupts, network stack, or disk paths.

3.2 User‑Mode vs. System‑Mode

%usr + %nice

high → application code hotspot (dead loop, heavy JSON/regex, etc.). %sys high → system calls, network packet processing, soft‑interrupts, disk I/O. %soft high → packet‑small‑burst, connection storms, iptables/conntrack pressure. %irq high → hardware interrupt issues. %steal high → host‑level CPU contention in virtualised environments.

3.3 Process‑Level Investigation

Identify the top CPU‑eating process and drill down:

ps -eo pid,ppid,user,psr,stat,%cpu,%mem,etime,cmd --sort=-%cpu | head -20
pidstat -u -t -p <code>PID</code> 1 5
cat /proc/<code>PID</code>/status
lsof -p <code>PID</code> | head -50

Newly launched process spikes → recent deployment or config change.

Process that spikes after days → thread leak, connection leak, GC degradation.

Single thread at 100% → dead loop, spin lock, busy‑wait.

For Java, map the hot thread to its hexadecimal ID and inspect the stack:

top -H -p <code>PID</code>
printf '0x%x
' <code>TID</code>
jstack <code>PID</code> | grep -A20 <code>0xTID</code>
perf top -p <code>PID</code> -g

For Go, use pprof:

curl -s http://127.0.0.1:6060/debug/pprof/profile?seconds=30 -o cpu.pprof
go tool pprof -top cpu.pprof

For Python, check for pure‑Python loops or oversized JSON serialization:

top -H -p <code>PID</code>
strace -p <code>PID</code> -tt -T -f -c

3.4 Thread‑Level Diagnosis

Thread‑level analysis often reveals the true root cause:

top -H -p <code>PID</code>
ps -Lp <code>PID</code> -o pid,tid,psr,%cpu,stat,comm --sort=-%cpu | head -20
pidstat -t -p <code>PID</code> 1 5

One thread fixed at 100% → dead loop or spin lock.

Multiple threads high → thread‑pool overload or hot key.

High CPU + high context switches → lock contention or excessive wake‑ups.

Use perf for lightweight sampling (15‑30 s is usually enough):

sudo perf top -p <code>PID</code> -g
sudo perf record -F 99 -p <code>PID</code> -g -- sleep 30
sudo perf report --stdio | head -80

If perf_event_paranoid blocks sampling, temporarily lower it with sysctl -w kernel.perf_event_paranoid=1 and restore after debugging.

4. Real‑World Cases

Case 1 – Java Thread Loop

Scenario: After a deployment, an order‑service’s response time rose from 30 ms to 800 ms. Overall CPU was 45 % but one core stayed at 100 %. The root cause was a newly added rule‑engine loop that repeatedly matched a regex.

Investigation commands:

top -H -p 28461
ps -Lp 28461 -o pid,tid,pcpu,comm --sort=-pcpu | head
printf '0x%x
' 28513
jstack 28461 | grep -A20 6f61
sudo perf top -p 28461 -g

Findings:

Thread 28513 consumed 99 % of a single core. jstack showed the thread stuck in the rule‑engine loop. perf pinpointed the hotspot to string splitting and regex matching.

Fix:

Remove the instance from the load‑balancer.

Roll back to the previous version and verify CPU drops.

Cache the rules and pre‑compile regexes.

Add timeout guards to the loop.

Before fix: core 99 %, P99 RT 1.2 s, order timeout 8.7 %
After fix: hotspot thread gone, overall CPU 31 %, P99 RT 85 ms

Case 2 – Soft‑Interrupt Storm

Scenario: An API‑gateway node showed >70 % %sys while Nginx workers were idle. Requests timed out frequently.

Investigation commands:

top -b -n 1 | head -10
cat /proc/softirqs | egrep 'NET_RX|NET_TX'
sar -n DEV,EDEV,TCP,ETCP 1 5
ss -s
ethtool -S eth0 | egrep 'drop|miss|queue'

Findings: NET_RX soft‑interrupts were concentrated on CPU0/1.

Processes ksoftirqd/0 and ksoftirqd/1 continuously consumed CPU.

NIC had multiple queues but IRQ affinity was uneven, causing the load to concentrate on two cores.

Remediation script (rebalance IRQs and enable irqbalance):

#!/usr/bin/env bash
set -euo pipefail
SERVICE="irqbalance"
NIC="eth0"
sudo systemctl enable --now "$SERVICE"
sudo ethtool -L "$NIC" combined 8
for irq in $(grep "$NIC" /proc/interrupts | awk -F: '{print $1}'); do
  echo 0f | sudo tee "/proc/irq/${irq}/smp_affinity" > /dev/null
done

Result:

Before: sys 72 %, NET_RX skewed to CPU0/1, request timeout 5.4 %
After: sys 28 %, soft‑interrupts balanced, timeout 0.3 %

5. Best Practices & Safety

5.1 Performance Optimisation

Keep sysstat running permanently (e.g., HISTORY=28) so that historical CPU, I/O and network trends are available for post‑mortem analysis.

Size thread pools according to CPU cores × 1‑2 and monitor cswch/s and nvcswch/s to avoid over‑provisioning.

For network‑heavy nodes, balance soft‑interrupts across cores (enable irqbalance and set appropriate smp_affinity).

5.2 Security Measures

Restrict execution of heavy debugging tools ( perf, strace, gdb) to a limited ops team (e.g., chmod 750 /usr/bin/perf).

Audit changes to sysctl, IRQ affinity and container CPU quotas.

Run snapshot scripts with read‑only permissions; avoid embedding automatic kill commands.

5.3 High‑Availability Considerations

Implement health‑check based traffic shedding before deep debugging.

Deploy critical services across multiple availability zones to avoid a single hot node dragging down the whole system.

Version‑control all CPU‑related configuration files and keep a backup before any change.

6. Monitoring & Alerting

Key metrics to watch: node_cpu_seconds_total split by user, system, iowait, steal, softirq.

Load average (1‑ and 5‑minute).

Top‑N process CPU usage.

Network retransmits, drops, SYN backlog overflow.

Container throttling counters ( container_cpu_cfs_throttled_periods_total).

Example Prometheus rules (simplified):

groups:
- name: cpu-hotspot
  rules:
  - alert: HostCpuHigh
    expr: 100 - (avg by(instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 85
    for: 5m
    labels:
      severity: critical
    annotations:
      summary: "Host CPU sustained high"
  - alert: HostSoftIrqHigh
    expr: avg by(instance) (rate(node_cpu_seconds_total{mode="softirq"}[5m])) * 100 > 25
    for: 3m
    labels:
      severity: warning
    annotations:
      summary: "Host soft‑interrupt CPU high"
  - alert: ContainerCpuThrottlingHigh
    expr: sum by(pod,namespace) (rate(container_cpu_cfs_throttled_periods_total[5m])) > 20
    for: 5m
    labels:
      severity: warning
    annotations:
      summary: "Container CPU throttling increasing"

7. Post‑mortem & Continuous Improvement

After a fix, verify three closure criteria:

Hot thread/function disappears or its contribution drops significantly (check perf report).

Key business endpoint returns to normal latency (e.g.,

curl -s -o /dev/null -w "%{http_code} %{time_total}
" http://127.0.0.1:8080/health

).

Monitoring metrics return to the 7‑day P95 baseline.

Document the root cause, the evidence collected, and the remediation steps. Incorporate the lessons into SOPs and automate snapshot collection via alert‑triggered scripts.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

Performance MonitoringOpslinuxtroubleshootingCPUSysstatperf
Raymond Ops
Written by

Raymond Ops

Linux ops automation, cloud-native, Kubernetes, SRE, DevOps, Python, Golang and related tech discussions.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.