Unlock Linux Performance: Master Metrics, Tools, and Optimization Techniques
This guide explains Linux performance optimization by defining key metrics such as throughput, latency, average load, and CPU usage, describing how to select and interpret tools like vmstat, pidstat, perf, and dstat, and offering concrete steps to diagnose and fix CPU, memory, I/O, and context‑switch bottlenecks.
Linux Performance Optimization Overview
High concurrency and low latency are measured by throughput and latency . A performance problem occurs when system resources hit a bottleneck while request handling remains too slow.
Key Performance Indicators
Application load impact on user experience
System resource usage (utilization, saturation)
Typical Analysis Workflow
Select metrics and set performance goals.
Run benchmarks (e.g., ab -c 10 -n 100 http://host:port/).
Locate bottlenecks with broad‑scope tools ( top, vmstat, pidstat).
Drill down using per‑process/thread tools ( pidstat -w, pidstat -u, perf).
Apply targeted fixes (code changes, configuration, resource limits) and re‑measure.
Understanding Average Load
Average load is the average number of runnable or uninterruptible processes over a time interval. It is not directly equivalent to CPU utilization. Uninterruptible (D) processes are typically waiting for I/O.
A practical rule of thumb: keep average load below 70% of the number of CPU cores. Sudden spikes indicate a potential bottleneck.
CPU Metrics and Context Switching
Use vmstat 5 to view overall context‑switch and interrupt rates:
vmstat 5
procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu-----
r b swpd free buff cache si so bi bo in cs us sy id wa st
1 0 0 103388 145412 511056 0 0 18 60 1 1 2 1 96 0 0
...Key fields: cs: context switches per second (voluntary vs. involuntary) in: interrupts per second r: length of the run queue (processes ready or running) b: blocked (uninterruptible) processes
For per‑process details, use pidstat -w 5 (context switches) and pidstat -w -u 1 (thread‑level). Example output:
pidstat -w 5
14:51:16 UID PID cswch/s nvcswch/s Command
14:51:21 0 1 0.80 0.00 systemd
14:51:21 0 6 1.40 0.00 ksoftirqd/0
14:51:21 0 9 32.67 0.00 rcu_schedHigh cs together with a large r value often means excessive scheduling overhead.
CPU Usage Analysis
Quick view with top or ps. For deeper insight, use perf:
perf top -g -p <em>PID</em>
perf record -d -p <em>PID</em>
perf report -gExample: a sysbench benchmark showed high CPU usage; perf top revealed hot functions sqrt and add_function. Removing a stray test loop in sqrt dramatically improved Nginx throughput.
Diagnosing High System CPU When No Process Stands Out
When top shows high overall CPU but no single process dominates, examine the run‑queue length ( r) and the number of processes in Running state. Short‑lived or repeatedly crashing processes can be missed. Use pstree -aps to trace parent‑child relationships and uncover hidden load sources.
Memory Management Fundamentals
Linux gives each process an isolated virtual address space split into kernel and user regions. Physical memory is allocated on demand via page faults, and the MMU translates virtual to physical addresses using multi‑level page tables.
Typical virtual‑memory layout (low to high addresses):
Read‑only segment (code, constants)
Data segment (globals)
Heap (dynamic allocations via brk() for < 128 KB, mmap() for larger blocks)
Memory‑mapped region (shared libraries, mmap)
Stack (local variables, call frames)
Memory Allocation & Reclamation
Allocation : brk() expands the heap; freed memory stays cached. mmap() creates a new mapping; freed memory is returned to the system, causing page‑faults on next use.
Reclamation : LRU cache eviction, swapping out rarely used pages, and the OOM killer for runaway processes. Adjust /proc/sys/vm/swappiness (0‑100) to control swap aggressiveness.
Monitoring Memory Usage
Common commands: free – overall memory and swap. top / ps – per‑process VIRT, RES, SHR, %MEM. pidstat -r 1 10 – per‑process page‑fault rates and RSS.
pidstat -r 1 10
UID PID minflt/s majflt/s VSZ RSS %MEM Command
0 1 0.20 0.00 191256 3064 0.01 systemdDetecting Memory Leaks
Use BCC's memleak to trace allocations that are never freed:
/usr/share/bcc/tools/memleak -a -p $(pidof app)Identify the leaking function (e.g., a Fibonacci routine) and add the missing free() calls.
Swap and NUMA Considerations
When swap usage rises despite free memory, NUMA effects may be involved. View per‑node memory with numactl --hardware and control local vs. remote reclamation via /proc/sys/vm/zone_raclaim_mode (0 = allow remote reclamation, 1/2/4 = restrict to local).
Swap aggressiveness is tuned with /proc/sys/vm/swappiness. Even with swappiness=0, swap can occur if free memory + file pages fall below the high watermark.
Performance Tool Matrix
The matrix maps metrics (CPU, memory, I/O, latency, etc.) to appropriate tools such as top, vmstat, pidstat, perf, dstat, BCC utilities ( memleak, cachetop), and others.
Practical Optimization Steps
Compile with gcc -O2 (or higher) for better code generation.
Prefer asynchronous I/O to avoid blocking.
Use multithreading instead of multiprocess to reduce context‑switch overhead.
Bind high‑priority services to specific CPUs (CPU affinity) and lower the nice value of less critical workloads.
Apply cgroups to cap memory and CPU usage.
Enable NUMA‑aware memory allocation (e.g., numactl --cpunodebind).
Balance interrupt handling across CPUs with irqbalance.
Typical Analysis Workflow
Run broad tools ( top, vmstat, pidstat) to locate the symptom.
Drill down with pidstat -d, pidstat -w, or perf to pinpoint the offending process or thread.
If I/O is the bottleneck, examine /proc/interrupts and use strace or perf record -d to see system calls.
Apply targeted fixes (code changes, configuration tweaks, resource limits) and re‑measure.
CPU Performance Metrics
CPU usage: user, system, iowait, soft/hard IRQ, steal/guest.
Average load – ideal value equals number of logical CPUs.
Context switches – voluntary vs. involuntary; excessive switches waste CPU cycles.
CPU cache hit rate – higher is better (L1/L2 per core, L3 shared).
Memory Performance Metrics
Used / free memory, buffers, cache, swap.
Per‑process VIRT, RES, SHR, %MEM.
Page‑fault rates (minor vs. major).
Cache and buffer hit ratios (use BCC tools cachetop, cachestat).
Key Tools Overview
top/ htop – real‑time CPU & memory. vmstat – system‑wide CPU, memory, I/O, and interrupt statistics. pidstat – per‑process/thread CPU, memory, I/O, and context‑switch metrics. perf – hardware‑level profiling, hot‑function identification. dstat – combined CPU, disk, network, and other resources.
BCC utilities ( memleak, cachetop, cachestat) – deep kernel‑space insight.
Example: High System CPU Without Visible Culprit
When top shows high CPU but no process dominates, check the run‑queue length ( r) and look for many processes in Running state. Short‑lived or repeatedly crashing processes may be invisible to top. Use pstree -aps to find parent processes (e.g., a stress test launched by php‑fpm) and then profile the offending binary with perf.
Example: Direct I/O Causing High iowait
In a container running an app with O_DIRECT, iowait spikes while CPU usage stays low. strace -p shows openat(..., O_RDONLY|O_DIRECT). Removing the O_DIRECT flag restores cache usage and reduces iowait.
Example: Memory Leak Detection
Run the leaking application in a Docker container, then execute:
/usr/share/bcc/tools/memleak -a -p $(pidof app)The output points to the leaking function; adding the missing free() eliminates the leak.
Example: Swap Increase Diagnosis
Monitor swap with free and sar -r -S 1. If swap rises while buffers dominate memory, use cachetop to check cache hit rate. Low hit rate indicates heavy I/O; investigate the responsible process with pidstat -d and strace or perf record.
Quick Reference Commands
vmstat 2– system snapshot every 2 seconds. pidstat -w 5 – context switches per second. pidstat -u 1 10 – CPU usage per process. pidstat -r 1 10 – memory usage per process. perf top -g -p <em>PID</em> – live hot‑function view. strace -p <em>PID</em> – trace system calls. numactl --hardware – NUMA node layout. cat /proc/interrupts – interrupt distribution. pstree -aps <em>PID</em> – process ancestry.
Optimization Checklist
Compile with optimization flags (e.g., -O2).
Use asynchronous I/O and event‑driven designs.
Prefer threads over processes to reduce context switches.
Set CPU affinity and appropriate nice values.
Apply cgroups limits for CPU and memory.
Enable NUMA‑aware allocations.
Balance interrupts with irqbalance.
Disable or minimize swap; tune swappiness if swap is required.
Monitor cache hit rates and avoid O_DIRECT unless necessary.
Detect and fix memory leaks early with BCC tools.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
ITPUB
Official ITPUB account sharing technical insights, community news, and exciting events.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
