Master Linux Performance: Key Metrics, Tools, and Optimization Techniques
This guide explains Linux performance optimization by defining core metrics such as throughput, latency, and average load, describing how to select and benchmark indicators, outlining essential analysis tools like vmstat, pidstat, and perf, and providing practical CPU and memory tuning strategies to eliminate bottlenecks.
Linux Performance Optimization
Performance Optimization
Performance Indicators
High concurrency and fast response correspond to two core metrics: throughput and latency.
Application load perspective: directly impacts end‑user experience.
System resource perspective: resource utilization, saturation, etc.
The essence of a performance problem is that system resources have reached a bottleneck while request processing is still too slow to handle more traffic. Performance analysis is about locating the bottleneck and mitigating it.
Select metrics to evaluate application and system performance.
Set performance goals for the application and the system.
Conduct performance benchmark tests.
Analyze and locate bottlenecks.
Monitor and set alerts.
Different performance problems require different analysis tools. Below are common Linux performance tools and the problem types they address.
Understanding “Average Load”
Average load : the average number of runnable and uninterruptible processes per unit time, i.e., the average number of active processes. It is not directly related to CPU utilization.
Uninterruptible processes are those in a kernel‑mode critical path (e.g., waiting for I/O). They represent a protection mechanism for processes and hardware.
When Is Average Load Reasonable?
In production, monitor average load over time. When a clear upward trend appears, investigate promptly, optionally setting a threshold (e.g., load > 70% of CPU count).
In practice, average load is often confused with CPU usage. They are not equivalent:
CPU‑intensive processes raise both load and CPU usage.
I/O‑intensive processes raise load while CPU usage may stay low.
Processes waiting for CPU raise both load and CPU usage.
High load may be caused by CPU‑bound or I/O‑bound workloads; tools like
mpstatand
pidstathelp identify the source.
CPU
CPU Context Switch (Upper)
A CPU context switch saves the previous task’s registers and program counter, then loads the new task’s context and jumps to its entry point. The saved context resides in the kernel until the task is rescheduled.
Context switches are classified by task type:
Process context switch
Thread context switch
Interrupt context switch
Process Context Switch
Linux separates kernel space and user space. A system call performs two context switches:
Save user‑mode instruction pointer, load kernel‑mode pointer, and jump to kernel code.
After the call returns, restore the saved user registers and resume user execution.
This is a privileged‑mode switch, not a full process switch. The kernel must also save the process’s virtual memory and stack before loading the new process.
Process switches occur when the scheduler allocates CPU time, e.g., time‑slice rotation, resource shortage, explicit
sleep, priority pre‑emption, or hardware interrupt handling.
Thread Context Switch
Two cases:
Threads belong to the same process – only thread‑private data and registers change; virtual memory stays the same.
Threads belong to different processes – same cost as a process switch.
Switching between threads of the same process consumes fewer resources, which is why multithreading is advantageous.
Interrupt Context Switch
Interrupt switches involve only kernel‑mode state (CPU registers, kernel stack, hardware interrupt parameters). Interrupt handling has higher priority than process execution, so interrupt and process switches never occur simultaneously.
CPU Context Switch (Lower)
Use
vmstatto view overall context‑switch statistics:
vmstat 5 # output a data set every 5 seconds
procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu-----
r b swpd free buff cache si so bi bo in cs us sy id wa st
1 0 0 103388 145412 511056 0 0 18 60 1 1 2 1 96 0 0
0 0 0 103388 145412 511076 0 0 0 2 450 1176 1 1 99 0 0
...cs – context switches per second.
in – interrupts per second.
r – length of the run queue (processes ready or running).
b – processes in uninterruptible sleep.
To inspect per‑process switches, use
pidstat -w:
pidstat -w 5Typical output shows voluntary (
cswch) and involuntary (
nvcswch) switches.
CPU Usage Scenarios
When an application’s CPU usage hits 100 %:
sudo docker run --name nginx -p 10000:80 -itd feisky/nginx
sudo docker run --name phpfpm -itd --network container:nginx feisky/php-fpm
ab -c 10 -n 100 http://<em>IP</em>:10000/ # basic load testIncreasing the request count reveals which processes drive the load.
topshows the PHP‑FPM processes spiking CPU.
perf top -g -p <em>PID</em> # locate hot functionsIn the example, a stray
sqrtcall with a million‑iteration loop caused the spike; removing it restored performance.
If CPU usage is high but no process appears responsible, examine the run queue. A large number of processes in the Running state can indicate short‑lived processes or crash‑restart loops. Use
pstreeto find parent processes, then check source code for missing
wait()or signal handling.
Process States (Reference)
R Running/Runnable – in the run queue. D Disk Sleep – uninterruptible I/O wait. Z Zombie – exited but not reaped. S Interruptible Sleep – waiting for an event. I Idle – kernel threads in uninterruptible sleep (does not affect load). T Stopped/Traced – paused or being debugged. X Dead – already terminated.
CPU Performance Metrics
CPU usage – user, system, iowait, soft/hard IRQ, steal/guest.
Average load – ideal value equals number of logical CPUs; higher values indicate overload.
Process context switches – voluntary and involuntary; excessive switches waste CPU cycles.
CPU cache hit rate – higher hit rate means better performance (L1/L2 for single core, L3 for multi‑core).
Performance Tools
Average load case:
uptime→
mpstat/
pidstatto locate heavy processes.
Context‑switch case:
vmstat→
pidstat -w→
pidstat -tfor thread‑level view.
High CPU process case:
top→
perf top→
perf record/report -g.
High system CPU case:
top(no obvious culprit) → re‑examine run queue →
execsnoopto catch short‑lived processes.
Uninterruptible/Zombie case: monitor
iowait, use
strace(may fail), then
perfto trace
sys_read()→
new_sync_read/
blkdev_direct_IO.
Soft‑IRQ case:
top→
/proc/softirqs→
sar -r→
tcpdumpto identify SYN‑FLOOD.
CPU Optimization
Application level : compiler flags (e.g.,
gcc -O2), algorithm improvements, asynchronous I/O, replace processes with threads, effective caching.
System level : CPU affinity/binding, exclusive CPU allocation, priority adjustment with
nice, cgroup limits, NUMA‑aware placement, interrupt load balancing (
irqpbalance).
Understand TPS/QPS, concurrency, and response time relationships for capacity planning.
Memory
How Linux Memory Works
Memory Mapping
Only the kernel can access physical RAM directly. Each process receives an isolated, contiguous virtual address space. The kernel maintains page tables that map virtual pages to physical frames; the MMU translates addresses on the fly.
Virtual Memory Layout
Read‑only segment – code and constants.
Data segment – global variables.
Heap – dynamically allocated memory, grows upward.
Memory‑mapped region – shared libraries, mmap files, grows downward.
Stack – local variables and call frames, typically 8 MiB.
Memory Allocation & Release
mallocuses two kernel mechanisms:
brk()for small allocations (< 128 KiB) by moving the program break.
mmap()for large allocations (> 128 KiB) via file‑backed mappings.
Both allocate virtual pages; physical memory is only committed on first access (page‑fault).
echo -16 > /proc/$(pidof <em>PID</em>)/oom_adjHow to Observe Memory Usage
free– system‑wide memory statistics.
top/ps– per‑process
VIRT,
RES,
SHR,
%MEM.
Buffer = cache for raw disk blocks; Cache = cache for file data. Both accelerate reads and writes.
Cache Hit Rate
Cache hit rate = (requests served from cache) / (total requests). Higher rates mean better performance. Tools such as
cachestat,
cachetop, and
pcstatexpose hit statistics.
# Install Go and pcstat
export GOPATH=~/go
export PATH=~/go/bin:$PATH
go get golang.org/x/sys/unix
go get github.com/tobert/pcstat/pcstat # dd cache test
dd if=/dev/sda1 of=file bs=1M count=512 # create 512 MiB file
echo 3 > /proc/sys/vm/drop_caches # clear caches
pcstat file # verify file is uncached
cachetop 5
dd if=file of=/dev/null bs=1M # first read – low hit rate
dd if=file of=/dev/null bs=1M # second read – high hit rateO_DIRECT to Bypass Cache
cachetop 5
sudo docker run --privileged --name=app -itd feisky/app:io-direct
sudo docker logs app
# The test shows ~0.9 s per 32 MiB read, confirming direct I/O.
strace -p $(pgrep app) # shows openat with O_RDONLY|O_DIRECTMemory Leak Detection
Memory leaks cause continuous growth of the process’s resident set, eventually exhausting RAM.
# Run a container that leaks memory
sudo docker run --name=app -itd feisky/app:mem-leak
vmstat 3
/usr/share/bcc/tools/memleak -a -p $(pidof app)The output shows allocations from the
fibonaccifunction that are never freed; fixing the code eliminates the leak.
Why Swap Grows
When memory is scarce, the kernel reclaims cache/buffers, swaps out anonymous pages, or kills processes via OOM. Swap stores rarely used pages on disk; they are read back on demand.
Swap Principle
Swap‑out – move inactive pages to disk, freeing RAM.
Swap‑in – read swapped pages back when accessed.
NUMA and Swap
On NUMA systems each node has local memory. Swap may increase even with free memory on other nodes. Use
numactl --hardwareto view per‑node usage.
Swappiness
/proc/sys/vm/swappiness(0‑100) controls how aggressively the kernel uses swap. Higher values favor swapping anonymous pages; lower values keep file cache.
Fast Memory Bottleneck Analysis
Typical workflow:
Run
freeand
topto get a global view.
Use
vmstatand
pidstatover a period to spot trends.
Drill down with allocation analysis, cache/buffer inspection, or per‑process memory profiling.
Common optimization ideas:
Disable swap or lower
swappiness.
Reduce dynamic allocations (memory pools, HugePages).
Leverage caches (in‑process buffers, external caches like Redis).
Apply cgroup limits to prevent runaway processes.
Adjust
/proc/pid/oom_adjto protect critical services.
Source: https://www.ctq6.cn/linux%E6%80%A7%E8%83%BD%E4%BC%98%E5%8C%96/
Efficient Ops
This public account is maintained by Xiaotianguo and friends, regularly publishing widely-read original technical articles. We focus on operations transformation and accompany you throughout your operations career, growing together happily.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.