Master Linux Performance: Key Metrics, Tools, and Optimization Strategies
This comprehensive guide explains Linux performance optimization by defining key metrics such as throughput and latency, interpreting average load, analyzing CPU context switches, memory management, and I/O behavior, and recommending practical tools and techniques—including vmstat, pidstat, perf, and dstat—to identify and resolve bottlenecks.
Linux Performance Optimization
Performance Optimization
Performance Indicators
High concurrency and fast response correspond to two core performance indicators: throughput and latency.
Application load perspective: directly affects end‑user experience.
System resource perspective: resource utilization, saturation, etc.
Performance problems arise when system resources hit a bottleneck while request processing is not fast enough to handle more requests. Performance analysis means finding the bottleneck in the application or system and trying to avoid or mitigate it.
Select metrics to evaluate application and system performance.
Set performance goals for applications and systems.
Conduct performance benchmark testing.
Locate bottlenecks through performance analysis.
Monitor performance and set alerts.
Different performance problems require different analysis tools. Below are common Linux performance tools and the types of performance issues they address.
Understanding "Average Load"
Average Load : The average number of processes in runnable or uninterruptible state per unit time, i.e., the average number of active processes. It is not directly related to CPU utilization in the traditional sense.
Uninterruptible processes are those in a kernel‑mode critical path (e.g., waiting for I/O). This state protects processes and hardware from being interrupted.
When Is Average Load Reasonable?
In production, monitor average load over time. If the load shows a clear upward trend, investigate promptly. A common rule of thumb is that a load higher than 70% of the number of CPUs may indicate a problem.
Average load is often confused with CPU utilization, but they are not equivalent:
CPU‑intensive processes raise both load and CPU usage.
I/O‑intensive processes raise load while CPU usage may stay low.
Heavy process scheduling raises both load and CPU usage.
High load can be caused by CPU‑bound work, I/O contention, or a mix of both. Tools such as
mpstatand
pidstathelp pinpoint the source.
CPU
CPU Context Switch (Upper)
CPU context switch saves the previous task's registers and program counter, then loads the new task's context and jumps to its entry point. The saved context resides in the kernel until the task is scheduled again.
Context switches are categorized by task type:
Process context switch
Thread context switch
Interrupt context switch
Process Context Switch
Linux separates kernel space and user space. Transition from user to kernel mode occurs via a system call.
A system call performs two context switches:
Save user‑mode instruction pointer, load kernel‑mode instruction pointer, and jump to kernel code.
After the call returns, restore the saved user registers and resume user space.
System calls do not involve virtual memory or user‑space resources, so they differ from traditional process switches and are often called privileged mode switches.
Process switches occur only in kernel mode; therefore, before saving the kernel state, the process's virtual memory and stack must also be saved.
Switches happen when a process gets a CPU time slice, is blocked due to insufficient resources, voluntarily sleeps, is pre‑empted by a higher‑priority process, or when an interrupt occurs.
Thread Context Switch
Thread switches come in two forms:
Threads within the same process share virtual memory; only thread‑private data and registers need to be switched.
Threads belonging to different processes require a full process‑level switch.
Switching between threads of the same process consumes fewer resources, which is why multithreading can be advantageous.
Interrupt Context Switch
Interrupt context switches involve only kernel‑mode state (CPU registers, kernel stack, hardware interrupt parameters) and do not affect user‑mode.
Interrupt handling has higher priority than processes, so interrupt and process context switches never occur simultaneously.
CPU Context Switch (Lower)
Use
vmstatto view overall context‑switch statistics:
vmstat 5 # output every 5 seconds
procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu-----
r b swpd free buff cache si so bi bo in cs us sy id wa st
1 0 0 103388 145412 511056 0 0 18 60 1 1 2 1 96 0 0
... cs: context switches per second.
in: interrupts per second.
r: length of the runnable queue.
b: number of processes in uninterruptible sleep.
To see per‑process details, use
pidstat -w:
pidstat -w 5Key fields:
cswch/s: voluntary switches per second.
nvcswch/s: involuntary switches per second.
Example commands for deeper analysis:
pidstat -w -u 1 # identify short‑lived high‑CPU processes
perf top -g -p <PID> # locate hot functions in a process
strace -p <PID> # trace system calls (requires root)CPU Performance Metrics
CPU utilization (user, system, iowait, soft/hard IRQ, steal/guest).
Average load (ideal value equals number of logical CPUs).
Process context switches (voluntary vs. involuntary).
CPU cache hit rate (L1/L2/L3).
Performance Tools
Average load case:
uptime, then
mpstatand
pidstatto find the culprit.
Context‑switch case:
vmstat→
pidstat→ thread‑level
pidstat.
High CPU usage case:
top→
perf top→
perf record/report.
System‑wide high CPU case: examine
topoutput, focus on processes in Running state, then use
perfto catch short‑lived execs.
Uninterruptible/Zombie process case: monitor
iowait, use
top,
pidstat,
strace, and
perfto trace I/O paths.
Soft‑IRQ case:
top→
/proc/softirqs→
sar→
tcpdumpto identify SYN‑FLOOD attacks.
Choose tools based on the metric you need to investigate.
CPU Optimization
Application optimization : compiler flags (e.g.,
gcc -O2), algorithm improvements, asynchronous processing, replace threads with processes, effective caching.
System optimization : CPU pinning, CPU isolation, priority adjustment with
nice, cgroup limits, NUMA awareness, interrupt load balancing (e.g.,
irqpbalance).
Throughput concepts : difference between TPS, QPS, concurrency, and response time.
Memory
How Linux Memory Works
Memory Mapping
Linux provides each process with an isolated virtual address space. The kernel maintains page tables that map virtual addresses to physical memory. Page faults trigger allocation of physical pages.
Virtual Memory Layout
Read‑only segment (code, constants).
Data segment (global variables).
Heap (dynamic allocation, grows upward).
Memory‑mapped region (shared libraries, shared memory, grows downward).
Stack (local variables, call context, fixed size, typically 8 MiB).
Memory Allocation & Release
Allocation
brk()for small allocations (<128 KB) by moving the heap top.
mmap()for large allocations (>128 KB) using memory‑mapped files.
Both allocate virtual memory; physical pages are committed on first access (page fault).
Release
Cache reclamation (LRU of page cache).
Swap out rarely used pages.
OOM killer terminates memory‑hogging processes (adjustable via
/proc/*/oom_adj).
echo -16 > /proc/$(pidof XXX)/oom_adjViewing Memory Usage
free: overall system memory.
top/ps: per‑process memory (VIRT, RES, SHR, %MEM).
What are Buffer and Cache? Buffer caches disk metadata; page cache stores file data.
Cache Hit Rate
Cache hit rate is the proportion of requests served directly from cache. Higher hit rates improve performance.
Tools such as
cachestatand
cachetop(from bcc) monitor cache behavior.
Memory Leak Detection
Use
memleakfrom bcc to trace allocations that are never freed:
/usr/share/bcc/tools/memleak -a -p $(pidof app)Identify the leaking function (e.g., a faulty
fibonacciimplementation) and add proper deallocation.
Swap Growth
When memory is tight, the kernel swaps out anonymous pages. Swap usage can be examined with
free,
swapon, and monitored via
sar -r -S.
NUMA architectures may cause swap to increase even when total memory is abundant; analyze per‑node memory with
numactl --hardwareand adjust
/proc/sys/vm/swapinessaccordingly.
Memory Performance Tools
Common tools include
free,
top,
vmstat,
pidstat,
dstat,
perf, and bcc utilities (
cachestat,
memleak,
pcstat).
Quick Memory Bottleneck Analysis
Start with
freeand
topfor a high‑level view.
Use
vmstatand
pidstatover time to spot trends.
Drill down with detailed tools (allocation tracing, cache analysis, per‑process inspection).
Typical optimization ideas: disable swap or lower swappiness, use memory pools or HugePages, increase cache usage, apply cgroup limits, and adjust OOM scores for critical services.
Efficient Ops
This public account is maintained by Xiaotianguo and friends, regularly publishing widely-read original technical articles. We focus on operations transformation and accompany you throughout your operations career, growing together happily.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.