Operations 45 min read

Master Linux Performance: Key Metrics, Tools, and Optimization Strategies

This comprehensive guide explains Linux performance optimization by defining key metrics such as throughput and latency, interpreting average load, analyzing CPU context switches, memory management, and I/O behavior, and recommending practical tools and techniques—including vmstat, pidstat, perf, and dstat—to identify and resolve bottlenecks.

Efficient Ops

Nov 15, 2022

Master Linux Performance: Key Metrics, Tools, and Optimization Strategies

Linux Performance Optimization

Performance Optimization

Performance Indicators

High concurrency and fast response correspond to two core performance indicators: throughput and latency.

Application load perspective: directly affects end‑user experience.

System resource perspective: resource utilization, saturation, etc.

Performance problems arise when system resources hit a bottleneck while request processing is not fast enough to handle more requests. Performance analysis means finding the bottleneck in the application or system and trying to avoid or mitigate it.

Select metrics to evaluate application and system performance.

Set performance goals for applications and systems.

Conduct performance benchmark testing.

Locate bottlenecks through performance analysis.

Monitor performance and set alerts.

Different performance problems require different analysis tools. Below are common Linux performance tools and the types of performance issues they address.

Understanding "Average Load"

Average Load : The average number of processes in runnable or uninterruptible state per unit time, i.e., the average number of active processes. It is not directly related to CPU utilization in the traditional sense.

Uninterruptible processes are those in a kernel‑mode critical path (e.g., waiting for I/O). This state protects processes and hardware from being interrupted.

When Is Average Load Reasonable?

In production, monitor average load over time. If the load shows a clear upward trend, investigate promptly. A common rule of thumb is that a load higher than 70% of the number of CPUs may indicate a problem.

Average load is often confused with CPU utilization, but they are not equivalent:

CPU‑intensive processes raise both load and CPU usage.

I/O‑intensive processes raise load while CPU usage may stay low.

Heavy process scheduling raises both load and CPU usage.

High load can be caused by CPU‑bound work, I/O contention, or a mix of both. Tools such as mpstat and pidstat help pinpoint the source.

CPU

CPU Context Switch (Upper)

CPU context switch saves the previous task's registers and program counter, then loads the new task's context and jumps to its entry point. The saved context resides in the kernel until the task is scheduled again.

Context switches are categorized by task type:

Process context switch

Thread context switch

Interrupt context switch

Process Context Switch

Linux separates kernel space and user space. Transition from user to kernel mode occurs via a system call.

A system call performs two context switches:

Save user‑mode instruction pointer, load kernel‑mode instruction pointer, and jump to kernel code.

After the call returns, restore the saved user registers and resume user space.

System calls do not involve virtual memory or user‑space resources, so they differ from traditional process switches and are often called privileged mode switches.

Process switches occur only in kernel mode; therefore, before saving the kernel state, the process's virtual memory and stack must also be saved.

Switches happen when a process gets a CPU time slice, is blocked due to insufficient resources, voluntarily sleeps, is pre‑empted by a higher‑priority process, or when an interrupt occurs.

Thread Context Switch

Thread switches come in two forms:

Threads within the same process share virtual memory; only thread‑private data and registers need to be switched.

Threads belonging to different processes require a full process‑level switch.

Switching between threads of the same process consumes fewer resources, which is why multithreading can be advantageous.

Interrupt Context Switch

Interrupt context switches involve only kernel‑mode state (CPU registers, kernel stack, hardware interrupt parameters) and do not affect user‑mode.

Interrupt handling has higher priority than processes, so interrupt and process context switches never occur simultaneously.

CPU Context Switch (Lower)

Use vmstat to view overall context‑switch statistics:

vmstat 5            # output every 5 seconds
procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu-----
 r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa st
 1  0      0 103388 145412 511056    0    0    18    60    1    1  2  1 96  0  0
 ...

cs

: context switches per second. in: interrupts per second. r: length of the runnable queue. b: number of processes in uninterruptible sleep.

To see per‑process details, use pidstat -w:

pidstat -w 5

Key fields: cswch/s: voluntary switches per second. nvcswch/s: involuntary switches per second.

Example commands for deeper analysis:

pidstat -w -u 1          # identify short‑lived high‑CPU processes
perf top -g -p <PID>    # locate hot functions in a process
strace -p <PID>         # trace system calls (requires root)

CPU Performance Metrics

CPU utilization (user, system, iowait, soft/hard IRQ, steal/guest).

Average load (ideal value equals number of logical CPUs).

Process context switches (voluntary vs. involuntary).

CPU cache hit rate (L1/L2/L3).

Performance Tools

Average load case: uptime, then mpstat and pidstat to find the culprit.

Context‑switch case: vmstat → pidstat → thread‑level pidstat.

High CPU usage case: top → perf top → perf record/report.

System‑wide high CPU case: examine top output, focus on processes in Running state, then use perf to catch short‑lived execs.

Uninterruptible/Zombie process case: monitor iowait, use top, pidstat, strace, and perf to trace I/O paths.

Soft‑IRQ case: top → /proc/softirqs → sar → tcpdump to identify SYN‑FLOOD attacks.

Choose tools based on the metric you need to investigate.

CPU Optimization

Application optimization : compiler flags (e.g., gcc -O2), algorithm improvements, asynchronous processing, replace threads with processes, effective caching.

System optimization : CPU pinning, CPU isolation, priority adjustment with nice, cgroup limits, NUMA awareness, interrupt load balancing (e.g., irqpbalance).

Throughput concepts : difference between TPS, QPS, concurrency, and response time.

Memory

How Linux Memory Works

Memory Mapping

Linux provides each process with an isolated virtual address space. The kernel maintains page tables that map virtual addresses to physical memory. Page faults trigger allocation of physical pages.

Virtual Memory Layout

Read‑only segment (code, constants).

Data segment (global variables).

Heap (dynamic allocation, grows upward).

Memory‑mapped region (shared libraries, shared memory, grows downward).

Stack (local variables, call context, fixed size, typically 8 MiB).

Memory Allocation & Release

Allocation brk() for small allocations (<128 KB) by moving the heap top. mmap() for large allocations (>128 KB) using memory‑mapped files.

Both allocate virtual memory; physical pages are committed on first access (page fault).

Release

Cache reclamation (LRU of page cache).

Swap out rarely used pages.

OOM killer terminates memory‑hogging processes (adjustable via /proc/*/oom_adj).

echo -16 > /proc/$(pidof XXX)/oom_adj

Viewing Memory Usage

free

: overall system memory. top/ps: per‑process memory (VIRT, RES, SHR, %MEM).

What are Buffer and Cache? Buffer caches disk metadata; page cache stores file data.

Cache Hit Rate

Cache hit rate is the proportion of requests served directly from cache. Higher hit rates improve performance.

Tools such as cachestat and cachetop (from bcc) monitor cache behavior.

Memory Leak Detection

Use memleak from bcc to trace allocations that are never freed: /usr/share/bcc/tools/memleak -a -p $(pidof app) Identify the leaking function (e.g., a faulty fibonacci implementation) and add proper deallocation.

Swap Growth

When memory is tight, the kernel swaps out anonymous pages. Swap usage can be examined with free, swapon, and monitored via sar -r -S.

NUMA architectures may cause swap to increase even when total memory is abundant; analyze per‑node memory with numactl --hardware and adjust /proc/sys/vm/swapiness accordingly.

Memory Performance Tools

Common tools include free, top, vmstat, pidstat, dstat, perf, and bcc utilities ( cachestat, memleak, pcstat).

Quick Memory Bottleneck Analysis

Start with free and top for a high‑level view.

Use vmstat and pidstat over time to spot trends.

Drill down with detailed tools (allocation tracing, cache analysis, per‑process inspection).

Typical optimization ideas: disable swap or lower swappiness, use memory pools or HugePages, increase cache usage, apply cgroup limits, and adjust OOM scores for critical services.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

monitoring Performance Optimization Linux CPU memory Tools

Written by

Efficient Ops

This public account is maintained by Xiaotianguo and friends, regularly publishing widely-read original technical articles. We focus on operations transformation and accompany you throughout your operations career, growing together happily.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.