Master Linux CPU & Memory Bottleneck Diagnosis: Commands, Scripts, and Best Practices
This comprehensive guide walks Linux operators through systematic CPU and memory troubleshooting, detailing command sequences, deep metric interpretations, diagnostic scripts, and preventive tuning for modern multi‑core, cgroup‑v2 environments.
The article addresses the common challenge of server resource bottlenecks, focusing on CPU and memory issues in modern Linux environments (Ubuntu 24.04 LTS, RHEL 9.4, kernel 6.x) where multi‑core CPUs and cgroup v2 are standard.
CPU Troubleshooting
Command workflow: uptime → top/htop → ps aux --sort=-%cpu → pidstat → top -H → perf top → strace -c → /proc/<pid>/stat.
Uptime shows load averages; compare against CPU core count (e.g., load < CPU cores is normal, load > 0.7 × cores warrants attention).
Top provides real‑time process and thread metrics; key fields ( %Cpu(s), us, sy, id, wa, hi, si, st) are explained.
PIDStat offers per‑process or per‑user CPU statistics (install sysstat first).
Thread‑level analysis with top -H and pidstat -t helps locate hot threads.
Perf (install via apt install linux-perf or yum install perf) captures hotspot functions and can generate flame graphs using Brendan Gregg’s FlameGraph repository.
# Example CPU diagnostic script (check_cpu.sh)
#!/bin/bash
set -euo pipefail
echo "=== CPU Diagnosis ==="
date
# System load overview
echo "[1] System load"
uptime
# CPU core count
echo "[2] CPU cores"
nproc
# Top 10 CPU‑hungry processes
echo "[3] Top CPU processes"
ps aux --sort=-%cpu | head -11
# Thread‑level snapshot
echo "[4] Top threads"
top -b -n 1 -H | awk '/^[ ]*[0-9]+/ {printf "CPU:%-6s PID:%-8s TID:%-8s CMD:%s
", $9,$1,$1,$12}' | sort -t: -k2 -rn | head -10
# Process state summary
ps aux | awk '{if($8~/R/) r++; else if($8~/S/) s++; else if($8~/T/) t++; else if($8~/Z/) z++} END {print "Running(R):"r" Sleeping(S):"s" Stopped(T):"t" Zombie(Z):"z}'
echo "=== Diagnosis Complete ==="Memory Troubleshooting
Command workflow: free -m → top → ps aux --sort=-%mem → vmstat → pmap -x → smem → slabtop.
Free shows total, used, free, buffers, cache, and available memory; free -h adds human‑readable units.
/proc/meminfo fields ( MemTotal, MemFree, MemAvailable, Buffers, Cached, etc.) are decoded, with formulas for actual usable memory.
Vmstat reports processes, memory, swap, I/O, system, and CPU statistics; key indicators for bottlenecks include non‑zero si/so (swap activity) and high wa (I/O wait).
Pmap reveals per‑process memory mappings, RSS vs. VSZ, and helps spot leaks by repeated sampling.
Smem provides enhanced memory reports, including PSS and USS, with optional bar or pie charts.
Slabtop inspects kernel caches (e.g., dentry_cache, inode_cache) for abnormal growth.
# Example memory diagnostic script (check_memory.sh)
#!/bin/bash
set -euo pipefail
echo "=== Memory Diagnosis ==="
date
# Overview
free -h
# Detailed /proc/meminfo parsing
awk '/^MemTotal:/ {total=$2} /^MemFree:/ {free=$2} /^MemAvailable:/ {avail=$2} END {printf "Total: %.2f GB
Free: %.2f GB
Available: %.2f GB
Usage: %.1f%%
", total/1024/1024, free/1024/1024, avail/1024/1024, (1-avail/total)*100}' /proc/meminfo
# Top memory consumers
ps aux --sort=-%mem | head -11
# Swap analysis
free | awk '/Swap:/ {if($2>0) printf "Swap usage: %.1f%%
", $3/$2*100; else print "No swap configured"}'
# Slab cache snapshot
slabtop -s c -d 5 | head -20
echo "=== Diagnosis Complete ==="Common Scenarios & Solutions
CPU 100% with high idle : Check for uninterruptible sleep (D state) via ps aux, I/O wait with iostat, and virtualization limits.
Memory growth without OOM : Distinguish cache growth (drop caches) from true leaks; use repeated ps sampling or valgrind for native binaries.
Frequent swap usage : Tune vm.swappiness, identify swap‑using processes, and consider service scaling or cgroup limits.
Java heap issues : Inspect JVM flags ( -Xms, -Xmx), monitor with jstat -gc, and enable heap dumps on OOM.
Nginx/PHP‑FPM memory : Review worker_processes, worker_rlimit_nofile, and PHP‑FPM pm.max_children settings; use pm.max_requests to mitigate leaks.
Database memory : Adjust innodb_buffer_pool_size (MySQL) and shared_buffers / work_mem (PostgreSQL) based on physical RAM.
Resource Limits & Control
ulimit : Show and modify per‑shell limits (open files, processes, virtual memory).
systemd service limits : Use LimitNOFILE, LimitNPROC, MemoryMax, CPUQuota in unit files.
cgroup v2 : Create a cgroup directory, set memory.max and pids.max, and assign processes via cgroup.procs or systemd-run.
Comprehensive Monitoring Scripts
Two Bash scripts are provided:
resource_monitor.sh : Periodic alerts for CPU, memory, swap, disk, and load thresholds, logging to /var/log/resource_alert.log.
full_diagnosis.sh : Gathers system info, CPU/memory stats, process snapshots, I/O, network, kernel, and recent OOM logs into a timestamped directory for post‑mortem analysis.
Prevention & Optimization
Kernel sysctl tuning : Adjust vm.swappiness, dirty ratios, fs.file-max, network backlog, and shared memory limits.
Service configuration : Optimize Nginx workers, PHP‑FPM pools, and database buffers according to workload and hardware.
Monitoring stack : Deploy Prometheus + node_exporter, Grafana, Alertmanager, plus exporters for MySQL, Nginx, and cAdvisor for container metrics.
By following the structured command order, using the provided scripts, and applying the recommended system and service tunings, operators can quickly pinpoint CPU or memory bottlenecks, reduce mean‑time‑to‑resolution, and proactively prevent resource‑related incidents.
Ops Community
A leading IT operations community where professionals share and grow together.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
