Master Linux Performance: Optimize CPU, Memory, and I/O with Proven Tools
This guide explains Linux performance optimization by defining key metrics such as throughput and latency, clarifying average load, detailing CPU context switches, describing common performance analysis tools, and providing practical methods for diagnosing and improving CPU, memory, and I/O bottlenecks in production environments.
Part 1 Linux Performance Optimization
1 Performance Optimization
Performance Metrics
High concurrency and fast response correspond to two core metrics: throughput and latency .
Performance problems arise when system resources hit a bottleneck while request handling is still too slow to support more requests. Performance analysis aims to locate these bottlenecks and mitigate them.
Application load : Directly impacts end‑user experience.
System resources : Resource utilization and saturation.
Key steps include selecting metrics, setting performance goals, conducting benchmarks, analyzing bottlenecks, and monitoring with alerts.
Understanding "Average Load"
Average load is the average number of runnable and uninterruptible processes over a time interval; it is not directly comparable to CPU utilization.
Uninterruptible processes are in kernel mode (e.g., waiting for I/O). This state protects processes and hardware.
When is Average Load Reasonable?
Monitor average load in production, compare against historical trends, and set thresholds (e.g., 70% of CPU count).
CPU‑intensive workloads raise load and align with CPU usage.
I/O‑intensive workloads raise load without high CPU usage.
Heavy scheduling also raises load and CPU usage.
2 CPU
CPU Context Switch (Upper)
CPU context switch saves the previous task's registers and program counter, loads the new task's context, and jumps to the new task.
Switch types:
Process context switch
Thread context switch
Interrupt context switch
Process Context Switch
Linux separates kernel and user space. A system call performs two context switches: saving user registers, loading kernel registers, and restoring them after the call.
System calls are privileged mode switches, not full process switches.
Thread Context Switch
Two cases: threads within the same process share virtual memory (lightweight), or threads of different processes (same as process switch). Same‑process thread switches consume fewer resources.
Interrupt Context Switch
Only kernel‑mode state is saved; interrupt handling has higher priority than process switches.
CPU Context Switch (Lower)
Use vmstat 5 to view overall context switches:
vmstat 5 # output every 5 seconds
procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu-----
r b swpd free buff cache si so bi bo in cs us sy id wa st
1 0 0 103388 145412 511056 0 0 18 60 1 1 2 1 96 0 0
...cs : context switches per second
in : interrupts per second
r : length of the run queue (runnable processes)
b : processes in uninterruptible sleep
Use pidstat -w 5 to see per‑process voluntary and involuntary switches.
pidstat -w 5
14:51:16 UID PID cswch/s nvcswch/s Command
14:51:21 0 1 0.80 0.00 systemd
...What to Do When an Application Hits 100% CPU?
Linux schedules tasks in short time slices using a tick timer. CPU usage is calculated from /proc/stat differences over an interval.
Tools: top, ps, and perf (top/record/report) to pinpoint hot functions.
perf top -g -p <PID>When System CPU Is High but No High‑CPU Process Is Visible
Investigate processes in the Running state, check for short‑lived execs, and use pstree to find parent processes (e.g., stress commands launched by php‑fpm).
Uninterruptible and Zombie Processes
Process states:
R – Running/Runnable
D – Uninterruptible (usually I/O)
Z – Zombie (exited but not reaped)
S – Interruptible sleep
I – Idle (kernel threads)
T – Stopped/Traced
X – Dead
Uninterruptible processes may indicate I/O problems; zombies consume PID space.
CPU Performance Indicators
CPU usage (user, system, iowait, soft/hard IRQ, steal/guest)
Average load (ideal ≈ number of logical CPUs)
Context switches (voluntary vs involuntary)
Cache hit rate (L1/L2/L3)
Performance Tools
Average load: uptime, then mpstat and pidstat to locate heavy processes.
Context switches: vmstat → pidstat → thread‑level pidstat -t.
High CPU process: top → perf top.
High system CPU without a culprit: re‑examine top, focus on Running processes, use perf record or execsnoop.
Uninterruptible/Zombie cases: top → pstree → source inspection.
Soft‑IRQ spikes: top, /proc/softirqs, sar, tcpdump.
3 Memory
How Linux Memory Works
Linux provides each process with an isolated virtual address space, split into kernel and user regions. Physical memory is allocated on demand via page tables managed by the MMU.
When a page is not present, a page‑fault occurs, the kernel allocates a physical page, updates the page table, and resumes the process.
Linux uses multi‑level page tables and HugePages to reduce overhead.
Virtual Memory Layout
Read‑only segment (code, constants)
Data segment (globals)
Heap (dynamic allocation, grows upward)
Memory‑mapped region (shared libraries, mmap, grows downward)
Stack (local variables, fixed size, typically 8 MiB)
Allocation and Release
brk() handles small allocations (<128 KB) by moving the program break; freed memory is cached.
mmap() handles large allocations (>128 KB) via direct mapping; memory is returned to the kernel on free, causing page faults on reuse.
How to View Memory Usage
free: overall system memory. top/ps: per‑process VIRT, RES, SHR, %MEM.
Buffers and Cache
Buffers cache raw disk blocks; cache stores file data. Both improve read/write performance.
Improving Cache Efficiency
Install bcc tools: cachestat, cachetop. Use pcstat to inspect file cache size.
# install Go
export GOPATH=~/go
export PATH=~/go/bin:$PATH
go get golang.org/x/sys/unix
go get github.com/tobert/pcstat/pcstatExample with dd to generate a 512 MiB file, drop caches, and measure cache hit rate.
O_DIRECT Bypassing Cache
Running a container with O_RDONLY|O_DIRECT shows low cache hit rates and slow reads, confirming direct I/O bypasses the page cache.
Memory Leaks
Leaks occur when heap allocations are not freed or when out‑of‑bounds accesses cause crashes.
Detect leaks with
bcc memleak:
/usr/share/bcc/tools/memleak -a -p $(pidof app)Swap Usage
When memory is tight, Linux swaps out anonymous pages. Swap activity can be observed via free, sar -r -S, and vmstat. The swappiness parameter (0‑100) controls how aggressively swap is used.
Analyzing High Swap
Create swap if missing, then monitor with dd, sar, cachetop, and /proc/zoneinfo to understand pressure on memory zones.
Memory Performance Tools
Use free, top, vmstat, pidstat for a broad view, then drill down with memleak, cachestat, perf, and numactl for NUMA‑aware analysis.
Original source: https://www.ctq6.cn/ (author: mikelLam)
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Open Source Linux
Focused on sharing Linux/Unix content, covering fundamentals, system development, network programming, automation/operations, cloud computing, and related professional knowledge.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
