Operations 30 min read

Master Linux CPU & Memory Bottleneck Diagnosis: Commands, Scripts, and Best Practices

This comprehensive guide walks Linux operators through systematic CPU and memory troubleshooting, detailing command sequences, deep metric interpretations, diagnostic scripts, and preventive tuning for modern multi‑core, cgroup‑v2 environments.

Ops Community
Ops Community
Ops Community
Master Linux CPU & Memory Bottleneck Diagnosis: Commands, Scripts, and Best Practices

The article addresses the common challenge of server resource bottlenecks, focusing on CPU and memory issues in modern Linux environments (Ubuntu 24.04 LTS, RHEL 9.4, kernel 6.x) where multi‑core CPUs and cgroup v2 are standard.

CPU Troubleshooting

Command workflow: uptimetop/htopps aux --sort=-%cpupidstattop -Hperf topstrace -c/proc/<pid>/stat.

Uptime shows load averages; compare against CPU core count (e.g., load < CPU cores is normal, load > 0.7 × cores warrants attention).

Top provides real‑time process and thread metrics; key fields ( %Cpu(s), us, sy, id, wa, hi, si, st) are explained.

PIDStat offers per‑process or per‑user CPU statistics (install sysstat first).

Thread‑level analysis with top -H and pidstat -t helps locate hot threads.

Perf (install via apt install linux-perf or yum install perf) captures hotspot functions and can generate flame graphs using Brendan Gregg’s FlameGraph repository.

# Example CPU diagnostic script (check_cpu.sh)
#!/bin/bash
set -euo pipefail

echo "=== CPU Diagnosis ==="
date

# System load overview
echo "[1] System load"
uptime

# CPU core count
echo "[2] CPU cores"
nproc

# Top 10 CPU‑hungry processes
echo "[3] Top CPU processes"
ps aux --sort=-%cpu | head -11

# Thread‑level snapshot
echo "[4] Top threads"
top -b -n 1 -H | awk '/^[ ]*[0-9]+/ {printf "CPU:%-6s PID:%-8s TID:%-8s CMD:%s
", $9,$1,$1,$12}' | sort -t: -k2 -rn | head -10

# Process state summary
ps aux | awk '{if($8~/R/) r++; else if($8~/S/) s++; else if($8~/T/) t++; else if($8~/Z/) z++} END {print "Running(R):"r" Sleeping(S):"s" Stopped(T):"t" Zombie(Z):"z}'

echo "=== Diagnosis Complete ==="

Memory Troubleshooting

Command workflow: free -mtopps aux --sort=-%memvmstatpmap -xsmemslabtop.

Free shows total, used, free, buffers, cache, and available memory; free -h adds human‑readable units.

/proc/meminfo fields ( MemTotal, MemFree, MemAvailable, Buffers, Cached, etc.) are decoded, with formulas for actual usable memory.

Vmstat reports processes, memory, swap, I/O, system, and CPU statistics; key indicators for bottlenecks include non‑zero si/so (swap activity) and high wa (I/O wait).

Pmap reveals per‑process memory mappings, RSS vs. VSZ, and helps spot leaks by repeated sampling.

Smem provides enhanced memory reports, including PSS and USS, with optional bar or pie charts.

Slabtop inspects kernel caches (e.g., dentry_cache, inode_cache) for abnormal growth.

# Example memory diagnostic script (check_memory.sh)
#!/bin/bash
set -euo pipefail

echo "=== Memory Diagnosis ==="
date

# Overview
free -h

# Detailed /proc/meminfo parsing
awk '/^MemTotal:/ {total=$2} /^MemFree:/ {free=$2} /^MemAvailable:/ {avail=$2} END {printf "Total: %.2f GB
Free: %.2f GB
Available: %.2f GB
Usage: %.1f%%
", total/1024/1024, free/1024/1024, avail/1024/1024, (1-avail/total)*100}' /proc/meminfo

# Top memory consumers
ps aux --sort=-%mem | head -11

# Swap analysis
free | awk '/Swap:/ {if($2>0) printf "Swap usage: %.1f%%
", $3/$2*100; else print "No swap configured"}'

# Slab cache snapshot
slabtop -s c -d 5 | head -20

echo "=== Diagnosis Complete ==="

Common Scenarios & Solutions

CPU 100% with high idle : Check for uninterruptible sleep (D state) via ps aux, I/O wait with iostat, and virtualization limits.

Memory growth without OOM : Distinguish cache growth (drop caches) from true leaks; use repeated ps sampling or valgrind for native binaries.

Frequent swap usage : Tune vm.swappiness, identify swap‑using processes, and consider service scaling or cgroup limits.

Java heap issues : Inspect JVM flags ( -Xms, -Xmx), monitor with jstat -gc, and enable heap dumps on OOM.

Nginx/PHP‑FPM memory : Review worker_processes, worker_rlimit_nofile, and PHP‑FPM pm.max_children settings; use pm.max_requests to mitigate leaks.

Database memory : Adjust innodb_buffer_pool_size (MySQL) and shared_buffers / work_mem (PostgreSQL) based on physical RAM.

Resource Limits & Control

ulimit : Show and modify per‑shell limits (open files, processes, virtual memory).

systemd service limits : Use LimitNOFILE, LimitNPROC, MemoryMax, CPUQuota in unit files.

cgroup v2 : Create a cgroup directory, set memory.max and pids.max, and assign processes via cgroup.procs or systemd-run.

Comprehensive Monitoring Scripts

Two Bash scripts are provided:

resource_monitor.sh : Periodic alerts for CPU, memory, swap, disk, and load thresholds, logging to /var/log/resource_alert.log.

full_diagnosis.sh : Gathers system info, CPU/memory stats, process snapshots, I/O, network, kernel, and recent OOM logs into a timestamped directory for post‑mortem analysis.

Prevention & Optimization

Kernel sysctl tuning : Adjust vm.swappiness, dirty ratios, fs.file-max, network backlog, and shared memory limits.

Service configuration : Optimize Nginx workers, PHP‑FPM pools, and database buffers according to workload and hardware.

Monitoring stack : Deploy Prometheus + node_exporter, Grafana, Alertmanager, plus exporters for MySQL, Nginx, and cAdvisor for container metrics.

By following the structured command order, using the provided scripts, and applying the recommended system and service tunings, operators can quickly pinpoint CPU or memory bottlenecks, reduce mean‑time‑to‑resolution, and proactively prevent resource‑related incidents.

MonitoringOpsLinuxCPUMemoryshell scripting
Ops Community
Written by

Ops Community

A leading IT operations community where professionals share and grow together.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.