Operations 45 min read

Master Linux Performance: Key Metrics, Tools, and Optimization Strategies

This comprehensive guide explains Linux performance optimization by covering core metrics such as throughput and latency, the meaning of average load, CPU context switching, memory management, common bottlenecks, and a suite of diagnostic tools like vmstat, pidstat, perf, and strace, plus practical tuning techniques for both applications and the operating system.

MaGe Linux Operations
MaGe Linux Operations
MaGe Linux Operations
Master Linux Performance: Key Metrics, Tools, and Optimization Strategies

Performance Optimization

High concurrency and fast response are measured by two core performance indicators: throughput and latency.

From the application load perspective, performance directly impacts the end‑user experience; from the system resource perspective, it concerns resource utilization and saturation.

The essence of a performance problem is that system resources have reached a bottleneck while request processing is not fast enough to handle more requests. Performance analysis aims to locate these bottlenecks and mitigate them.

Choose metrics to evaluate application and system performance.

Set performance targets for applications and the system.

Conduct performance benchmark tests.

Analyze performance to locate bottlenecks.

Implement performance monitoring and alerts.

Different performance issues require different analysis tools. Below is a list of common Linux performance tools and the types of problems they address.

Understanding "Average Load"

Average Load : The average number of processes in runnable or uninterruptible states per unit time, i.e., the average number of active processes. It is not directly related to CPU utilization as traditionally understood.

Uninterruptible processes are those in kernel mode waiting for I/O; they act as a protection mechanism for processes and hardware.

When Is Average Load Reasonable?

In production, monitor average load over time and compare against historical trends. If load rises sharply, investigate promptly. A common rule of thumb is to keep average load below 70% of the number of CPU cores.

CPU‑intensive processes raise load and CPU usage simultaneously.

I/O‑intensive processes raise load without necessarily increasing CPU usage.

Heavy process scheduling also raises load and CPU usage.

High load may stem from CPU‑bound or I/O‑bound workloads; tools like mpstat and pidstat help identify the source.

CPU

CPU Context Switch (Upper Part)

A CPU context switch saves the previous task's CPU context (registers and PC) and loads the new task's context, then jumps to the new task's entry point. The saved context resides in the kernel until the task is rescheduled.

Context switches are categorized by task type:

Process context switch

Thread context switch

Interrupt context switch

Process Context Switch

Linux separates kernel space and user space. Switching from user to kernel involves a system call, which actually performs two context switches: saving user registers, loading kernel registers, executing kernel code, then restoring user registers after the call returns.

Unlike ordinary process switches, a system call does not involve virtual memory or user‑space resources and is therefore called a privileged mode switch.

Process switches occur only in kernel mode, requiring the kernel to save the process's virtual memory and stack before loading the new process's kernel state.

Thread Context Switch

Thread switches come in two forms:

Threads within the same process share virtual memory; only thread‑specific data and registers need to be switched.

Threads belonging to different processes require a full process‑level switch.

Switching between threads of the same process consumes fewer resources, which is a key advantage of multithreading.

Interrupt Context Switch

Interrupt switches involve only kernel‑mode state (CPU registers, kernel stack, interrupt parameters) and never occur simultaneously with process switches because interrupt handling has higher priority.

CPU Context Switch (Lower Part)

Use vmstat to view overall context‑switch statistics:

vmstat 5          # output every 5 seconds
procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu-----
 r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa st
 1  0      0 103388 145412 511056    0    0    18    60    1    1  2  1 96  0  0
 ...

Key fields: cs – context switches per second. in – interrupts per second. r – length of the run queue (processes ready or running). b – processes in uninterruptible sleep.

To see per‑process details, use pidstat -w:

pidstat -w 5
14:51:16   UID   PID  cswch/s nvcswch/s  Command
14:51:21     0     1   0.80      0.00  systemd
 ...

High cswch indicates voluntary switches (resource contention); high nvcswch indicates involuntary switches (time‑slice preemption).

CPU Performance Indicators

CPU usage (user, system, iowait, soft/hard IRQ, steal/guest).

Average load (ideal equals number of logical CPUs).

Process context switches (voluntary vs. involuntary).

CPU cache hit rate (higher is better).

Performance Tools

Average load: uptime, then mpstat and pidstat to pinpoint processes.

Context switch analysis: vmstatpidstat → thread‑level pidstat.

High process CPU usage: topperf top → identify hot functions.

High system CPU usage without a culprit: re‑examine top, focus on processes in Running state, then use perf record/report or execsnoop.

Uninterruptible and zombie processes: topiowait spikes → strace (often fails) → perf → locate direct I/O calls.

Soft‑IRQ spikes: top/proc/softirqssartcpdump to identify attacks.

Choose tools based on the metric you need to investigate.

CPU Optimization

Application level : compiler flags (e.g., gcc -O2), algorithm improvements, asynchronous processing, replace processes with threads, leverage caches.

System level : CPU binding/affinity, set process priority with nice, limit resources with cgroups, NUMA‑aware placement, interrupt load balancing ( irqpbalance).

Understand QPS/TPS: QPS = concurrency / average response time; TPS measures transactions per second.

Memory

How Linux Memory Works

Only the kernel can access physical RAM directly. Each process gets an isolated, contiguous virtual address space, which the kernel maps to physical pages via page tables stored in the MMU.

When a virtual address is not present in the page table, a page‑fault occurs, the kernel allocates a physical page, updates the page table, and resumes the process.

Linux uses multi‑level page tables and HugePages to reduce overhead.

Virtual Memory Layout

Read‑only segment (code, constants)

Data segment (global variables)

Heap (dynamic allocations, grows upward)

Memory‑mapped region (shared libraries, mmap‑ed files, grows downward)

Stack (local variables, call context, fixed size, typically 8 MiB)

Memory Allocation & Reclamation

Allocation : brk() for small allocations (<128 KiB) by moving the heap top; freed memory is cached. mmap() for large allocations (>128 KiB) via memory‑mapped files; freed memory is returned to the system.

Freed memory is not immediately returned for brk allocations, which can cause fragmentation under heavy load.

Reclamation when memory is tight:

Cache reclamation (LRU of least‑recently‑used pages).

Swap out rarely used pages.

OOM killer terminates high‑oom‑score processes.

echo -16 > /proc/$(pidof XXX)/oom_adj

Viewing Memory Usage

free

– overall system memory. top / ps – per‑process memory (VIRT, RES, SHR, %MEM).

How to understand Buffer vs. Cache?

Buffers cache disk metadata; Cache stores file data. Both accelerate reads and writes.

Memory Leak Detection

Run a workload, observe free decreasing while buffers stay stable. Use BCC’s memleak to trace allocations: /usr/share/bcc/tools/memleak -a -p $(pidof app) Identify leaking call stacks (e.g., a Fibonacci function) and add proper free() calls.

Why Swap Increases

When RAM is scarce, the kernel swaps out anonymous pages; file‑backed pages can be dropped from cache. Swap usage can also rise on NUMA systems if a node runs out of local memory.

Swap thresholds are defined by pages_min, pages_low, and pages_high in /proc/zoneinfo. The swappiness parameter (0‑100) controls how aggressively the kernel uses swap.

Analyzing Swap Growth

# create swap file
fallocate -l 8G /mnt/swapfile
chmod 600 /mnt/swapfile
mkswap /mnt/swapfile
swapon /mnt/swapfile

# simulate heavy I/O
dd if=/dev/sda1 of=/dev/null bs=1G count=2048
sar -r -S 1

Monitor %memused, kbbuffers, and kbcached to see whether memory pressure or I/O is driving swap.

Memory Performance Tools

Common tools include free, top, vmstat, pidstat, BCC utilities ( memleak, cachetop), dstat, and perf. Choose based on the metric you need.

Quick Memory Bottleneck Analysis

Start with free and top for a high‑level view.

Use vmstat and pidstat over time to spot trends.

Drill down with allocation tracing, cache analysis, or per‑process inspection.

Typical optimizations: disable swap if possible, lower swappiness, use memory pools or HugePages, increase cache usage, apply cgroup limits, and adjust oom_adj for critical services.

For detailed CPU and memory metrics, see the vmstat and pidstat sections above.

Source: https://www.ctq6.cn/linux性能优化/

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

CPU optimizationsystem-monitoringperformance toolsLinux performance
MaGe Linux Operations
Written by

MaGe Linux Operations

Founded in 2009, MaGe Education is a top Chinese high‑end IT training brand. Its graduates earn 12K+ RMB salaries, and the school has trained tens of thousands of students. It offers high‑pay courses in Linux cloud operations, Python full‑stack, automation, data analysis, AI, and Go high‑concurrency architecture. Thanks to quality courses and a solid reputation, it has talent partnerships with numerous internet firms.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.