Operations 47 min read

Master Linux Performance: Key Metrics, Tools, and Optimization Techniques

This guide explains Linux performance optimization by defining core metrics such as throughput, latency, and average load, describing how to select and benchmark indicators, outlining essential analysis tools like vmstat, pidstat, and perf, and providing practical CPU and memory tuning strategies to eliminate bottlenecks.

Efficient Ops

Mar 1, 2022

Master Linux Performance: Key Metrics, Tools, and Optimization Techniques

Linux Performance Optimization

Performance Optimization

Performance Indicators

High concurrency and fast response correspond to two core metrics: throughput and latency.

Application load perspective: directly impacts end‑user experience.

System resource perspective: resource utilization, saturation, etc.

The essence of a performance problem is that system resources have reached a bottleneck while request processing is still too slow to handle more traffic. Performance analysis is about locating the bottleneck and mitigating it.

Select metrics to evaluate application and system performance.

Set performance goals for the application and the system.

Conduct performance benchmark tests.

Analyze and locate bottlenecks.

Monitor and set alerts.

Different performance problems require different analysis tools. Below are common Linux performance tools and the problem types they address.

Understanding “Average Load”

Average load : the average number of runnable and uninterruptible processes per unit time, i.e., the average number of active processes. It is not directly related to CPU utilization.

Uninterruptible processes are those in a kernel‑mode critical path (e.g., waiting for I/O). They represent a protection mechanism for processes and hardware.

When Is Average Load Reasonable?

In production, monitor average load over time. When a clear upward trend appears, investigate promptly, optionally setting a threshold (e.g., load > 70% of CPU count).

In practice, average load is often confused with CPU usage. They are not equivalent:

CPU‑intensive processes raise both load and CPU usage.

I/O‑intensive processes raise load while CPU usage may stay low.

Processes waiting for CPU raise both load and CPU usage.

High load may be caused by CPU‑bound or I/O‑bound workloads; tools like mpstat and pidstat help identify the source.

CPU

CPU Context Switch (Upper)

A CPU context switch saves the previous task’s registers and program counter, then loads the new task’s context and jumps to its entry point. The saved context resides in the kernel until the task is rescheduled.

Context switches are classified by task type:

Process context switch

Thread context switch

Interrupt context switch

Process Context Switch

Linux separates kernel space and user space. A system call performs two context switches:

Save user‑mode instruction pointer, load kernel‑mode pointer, and jump to kernel code.

After the call returns, restore the saved user registers and resume user execution.

This is a privileged‑mode switch, not a full process switch. The kernel must also save the process’s virtual memory and stack before loading the new process.

Process switches occur when the scheduler allocates CPU time, e.g., time‑slice rotation, resource shortage, explicit sleep, priority pre‑emption, or hardware interrupt handling.

Thread Context Switch

Two cases:

Threads belong to the same process – only thread‑private data and registers change; virtual memory stays the same.

Threads belong to different processes – same cost as a process switch.

Switching between threads of the same process consumes fewer resources, which is why multithreading is advantageous.

Interrupt Context Switch

Interrupt switches involve only kernel‑mode state (CPU registers, kernel stack, hardware interrupt parameters). Interrupt handling has higher priority than process execution, so interrupt and process switches never occur simultaneously.

CPU Context Switch (Lower)

Use vmstat to view overall context‑switch statistics:

vmstat 5         # output a data set every 5 seconds
procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu-----
 r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa st
 1  0      0 103388 145412 511056    0    0    18    60    1    1  2  1 96  0  0
 0  0      0 103388 145412 511076    0    0     0     2  450 1176  1  1 99  0  0
 ...

cs – context switches per second.

in – interrupts per second.

r – length of the run queue (processes ready or running).

b – processes in uninterruptible sleep.

To inspect per‑process switches, use pidstat -w:

pidstat -w 5

Typical output shows voluntary ( cswch) and involuntary ( nvcswch) switches.

CPU Usage Scenarios

When an application’s CPU usage hits 100 %:

sudo docker run --name nginx -p 10000:80 -itd feisky/nginx
sudo docker run --name phpfpm -itd --network container:nginx feisky/php-fpm
ab -c 10 -n 100 http://<em>IP</em>:10000/   # basic load test

Increasing the request count reveals which processes drive the load. top shows the PHP‑FPM processes spiking CPU.

perf top -g -p <em>PID</em>   # locate hot functions

In the example, a stray sqrt call with a million‑iteration loop caused the spike; removing it restored performance.

If CPU usage is high but no process appears responsible, examine the run queue. A large number of processes in the Running state can indicate short‑lived processes or crash‑restart loops. Use pstree to find parent processes, then check source code for missing wait() or signal handling.

Process States (Reference)

R Running/Runnable – in the run queue. D Disk Sleep – uninterruptible I/O wait. Z Zombie – exited but not reaped. S Interruptible Sleep – waiting for an event. I Idle – kernel threads in uninterruptible sleep (does not affect load). T Stopped/Traced – paused or being debugged. X Dead – already terminated.

CPU Performance Metrics

CPU usage – user, system, iowait, soft/hard IRQ, steal/guest.

Average load – ideal value equals number of logical CPUs; higher values indicate overload.

Process context switches – voluntary and involuntary; excessive switches waste CPU cycles.

CPU cache hit rate – higher hit rate means better performance (L1/L2 for single core, L3 for multi‑core).

Performance Tools

Average load case: uptime → mpstat / pidstat to locate heavy processes.

Context‑switch case: vmstat → pidstat -w → pidstat -t for thread‑level view.

High CPU process case: top → perf top → perf record/report -g.

High system CPU case: top (no obvious culprit) → re‑examine run queue → execsnoop to catch short‑lived processes.

Uninterruptible/Zombie case: monitor iowait, use strace (may fail), then perf to trace sys_read() → new_sync_read / blkdev_direct_IO.

Soft‑IRQ case: top → /proc/softirqs → sar -r → tcpdump to identify SYN‑FLOOD.

CPU Optimization

Application level : compiler flags (e.g., gcc -O2), algorithm improvements, asynchronous I/O, replace processes with threads, effective caching.

System level : CPU affinity/binding, exclusive CPU allocation, priority adjustment with nice, cgroup limits, NUMA‑aware placement, interrupt load balancing ( irqpbalance).

Understand TPS/QPS, concurrency, and response time relationships for capacity planning.

Memory

How Linux Memory Works

Memory Mapping

Only the kernel can access physical RAM directly. Each process receives an isolated, contiguous virtual address space. The kernel maintains page tables that map virtual pages to physical frames; the MMU translates addresses on the fly.

Virtual Memory Layout

Read‑only segment – code and constants.

Data segment – global variables.

Heap – dynamically allocated memory, grows upward.

Memory‑mapped region – shared libraries, mmap files, grows downward.

Stack – local variables and call frames, typically 8 MiB.

Memory Allocation & Release

malloc

uses two kernel mechanisms: brk() for small allocations (< 128 KiB) by moving the program break. mmap() for large allocations (> 128 KiB) via file‑backed mappings.

Both allocate virtual pages; physical memory is only committed on first access (page‑fault).

echo -16 > /proc/$(pidof <em>PID</em>)/oom_adj

How to Observe Memory Usage

free

– system‑wide memory statistics. top/ps – per‑process VIRT, RES, SHR, %MEM.

Buffer = cache for raw disk blocks; Cache = cache for file data. Both accelerate reads and writes.

Cache Hit Rate

Cache hit rate = (requests served from cache) / (total requests). Higher rates mean better performance. Tools such as cachestat, cachetop, and pcstat expose hit statistics.

# Install Go and pcstat
export GOPATH=~/go
export PATH=~/go/bin:$PATH
go get golang.org/x/sys/unix
go get github.com/tobert/pcstat/pcstat

# dd cache test
dd if=/dev/sda1 of=file bs=1M count=512   # create 512 MiB file
echo 3 > /proc/sys/vm/drop_caches        # clear caches
pcstat file                               # verify file is uncached
cachetop 5
dd if=file of=/dev/null bs=1M            # first read – low hit rate
dd if=file of=/dev/null bs=1M            # second read – high hit rate

O_DIRECT to Bypass Cache

cachetop 5
sudo docker run --privileged --name=app -itd feisky/app:io-direct
sudo docker logs app
# The test shows ~0.9 s per 32 MiB read, confirming direct I/O.
strace -p $(pgrep app)   # shows openat with O_RDONLY|O_DIRECT

Memory Leak Detection

Memory leaks cause continuous growth of the process’s resident set, eventually exhausting RAM.

# Run a container that leaks memory
sudo docker run --name=app -itd feisky/app:mem-leak
vmstat 3
/usr/share/bcc/tools/memleak -a -p $(pidof app)

The output shows allocations from the fibonacci function that are never freed; fixing the code eliminates the leak.

Why Swap Grows

When memory is scarce, the kernel reclaims cache/buffers, swaps out anonymous pages, or kills processes via OOM. Swap stores rarely used pages on disk; they are read back on demand.

Swap Principle

Swap‑out – move inactive pages to disk, freeing RAM.

Swap‑in – read swapped pages back when accessed.

NUMA and Swap

On NUMA systems each node has local memory. Swap may increase even with free memory on other nodes. Use numactl --hardware to view per‑node usage.

Swappiness

/proc/sys/vm/swappiness

(0‑100) controls how aggressively the kernel uses swap. Higher values favor swapping anonymous pages; lower values keep file cache.

Fast Memory Bottleneck Analysis

Typical workflow:

Run free and top to get a global view.

Use vmstat and pidstat over a period to spot trends.

Drill down with allocation analysis, cache/buffer inspection, or per‑process memory profiling.

Common optimization ideas:

Disable swap or lower swappiness.

Reduce dynamic allocations (memory pools, HugePages).

Leverage caches (in‑process buffers, external caches like Redis).

Apply cgroup limits to prevent runaway processes.

Adjust /proc/pid/oom_adj to protect critical services.

Source: https://www.ctq6.cn/linux%E6%80%A7%E8%83%BD%E4%BC%98%E5%8C%96/

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

monitoring optimization performance tuning Linux CPU Memory

Written by

Efficient Ops

This public account is maintained by Xiaotianguo and friends, regularly publishing widely-read original technical articles. We focus on operations transformation and accompany you throughout your operations career, growing together happily.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.