Operations 46 min read

Unlock Linux Performance: Master Load, CPU, Memory & Optimization Tools

This guide explains Linux performance optimization by defining key metrics such as throughput and latency, interpreting average load, exploring CPU context switches, and providing practical step‑by‑step instructions for using tools like vmstat, pidstat, perf, strace, and memory analysis utilities to diagnose and resolve CPU, I/O, and memory bottlenecks.

Liangxu Linux

Feb 14, 2022

Unlock Linux Performance: Master Load, CPU, Memory & Optimization Tools

Linux Performance Overview

High concurrency and low latency are driven by two core metrics: throughput and latency . Performance problems arise when system resources become saturated while request handling remains too slow.

Key Metrics and Workflow

Application load – directly impacts end‑user experience.

System resources – CPU, memory, I/O, and their utilization/saturation.

A typical analysis workflow:

Select metrics and set performance goals.

Run benchmarks.

Locate bottlenecks.

Configure alerts and monitor.

Understanding Average Load

Average load is the average number of runnable or uninterruptible processes over a time interval. It does not map directly to CPU utilization.

CPU‑bound workloads raise both load and CPU usage.

I/O‑bound workloads increase load without a proportional CPU rise.

Heavy scheduling (many processes waiting for CPU) also raises load.

When the load exceeds the number of logical CPUs, the system is likely under stress.

CPU Analysis

Context Switch Types

Process context switch – switches between different processes.

Thread context switch – switches between threads; if threads belong to the same process, only thread‑local state changes.

Interrupt context switch – handles kernel‑mode interrupt service routines; higher priority than process switches.

Monitoring Context Switches

Use vmstat 5 to view system‑wide context switches and interrupts.

vmstat 5
procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu-----
 r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa st
 1  0      0 103388 145412 511056    0    0    18    60    1    1  2  1 96  0  0
 ...

Key fields: cs – context switches per second. in – interrupts per second. r – length of the runnable queue. b – processes in uninterruptible sleep.

For per‑process details, use pidstat -w 5 to see voluntary ( cswch/s) and involuntary ( nvcswch/s) switches.

pidstat -w 5
14:51:16 UID   PID   cswch/s nvcswch/s Command
...   1   0.80   0.00   systemd
...   9  32.67   0.00   rcu_sched

Diagnosing High CPU Usage

Typical steps:

Run top to spot high‑CPU processes.

Use perf top -g -p <pid> to locate hot functions.

Inspect source code for inefficient loops or unnecessary calculations.

If no single process dominates CPU, examine the runnable queue, interrupt rates, and short‑lived processes (e.g., stress tests) with pstree and perf record.

Memory Management

Virtual Memory Layout

Read‑only segment – code and constants.

Data segment – global variables.

Heap – dynamic allocations, grows upward.

Memory‑mapped region – shared libraries and mmap files, grows downward.

Stack – function call frames, typically 8 MiB.

Allocation Strategies

malloc

is backed by two syscalls: brk() for allocations < 128 KiB – moves the program break; freed memory is cached. mmap() for allocations ≥ 128 KiB – creates a new mapping; memory is returned to the kernel on munmap.

Both allocate lazily; physical pages are assigned only on first access.

Reclaiming Memory

LRU cache eviction.

Swapping out rarely used pages.

OOM killer (adjust /proc/<pid>/oom_adj to protect critical processes).

Inspecting Memory Usage

Use free for overall memory, top / ps for per‑process details. Important columns in top:

VIRT – virtual size.

RES – resident (physical) size.

SHR – shared memory.

%MEM – proportion of total RAM.

Buffers vs. Cache

Buffers cache raw disk blocks; Cache stores file data. Both improve I/O performance but consume memory that can be reclaimed.

Cache Performance Tools

Tools from the BCC suite such as cachestat, cachetop, and pcstat report hit rates.

# Create a 512 MiB test file
dd if=/dev/sda1 of=file bs=1M count=512
# Drop caches
echo 3 > /proc/sys/vm/drop_caches
# Verify cache state
pcstat file
# Measure read speed
dd if=file of=/dev/null bs=1M

If a file is opened with O_DIRECT, the kernel bypasses the page cache, leading to slower throughput. Verify the flag with strace -p $(pgrep app).

Detecting Memory Leaks

Run the target application in a container, monitor free for decreasing available memory, then use BCC’s memleak:

/usr/share/bcc/tools/memleak -a -p $(pidof app)

The output lists allocation sites that never free memory (e.g., a fibonacci function).

Swap Behavior and NUMA

When RAM is scarce, the kernel swaps out anonymous pages. Inspect swap activity with free, sar -r -S, and watch -d cat /proc/zoneinfo. The swappiness parameter (0‑100) controls how aggressively the system uses swap.

On NUMA systems each node has local memory. Use numactl --hardware to view per‑node usage and /proc/zoneinfo to see thresholds ( pages_min, pages_low, pages_high). Adjust /proc/sys/vm/zone_reclaim_mode to influence local vs. remote reclamation.

Practical Performance Workflow

Start with high‑level tools: top, free, vmstat, pidstat to identify the symptom (high load, CPU, I/O, or memory).

Drill down with specialized tools: perf for CPU hotspots. strace for system‑call patterns. cachetop for cache hit rates. memleak for memory leaks.

Correlate metrics (e.g., rising iowait with increased D ‑state processes) to pinpoint the root cause.

Apply targeted optimizations:

Code refactoring, compiler flags (e.g., gcc -O2).

Thread‑pool sizing, CPU affinity, cgroup limits.

Kernel parameter tuning (e.g., swappiness, zone_reclaim_mode).

Validate improvements by re‑running the same measurement suite.

CPU Performance Indicators

User CPU (%usr) – time spent in user space.

System CPU (%sys) – time spent in kernel space.

I/O wait (%wa) – time spent waiting for I/O.

Soft/Hard interrupt CPU (%irq/%soft) – time handling interrupts.

Steal/Guest – relevant for virtualized environments.

Average load – ideal value equals the number of logical CPUs; higher values indicate contention.

Context switches – excessive switches waste CPU cycles.

CPU cache hit rate – higher rates improve performance; monitor with perf cache‑miss or BCC tools.

CPU Optimization Techniques

Compile with optimization flags (e.g., gcc -O2).

Replace polling with event‑driven asynchronous I/O.

Prefer multithreading over multiprocess to reduce context‑switch overhead.

Bind processes/threads to specific CPUs (CPU affinity) to improve cache locality.

Adjust niceness or cgroup CPU quotas for less critical workloads.

On NUMA, ensure threads access local memory to reduce remote latency.

Balance interrupt handling across CPUs (e.g., irqbalance).

Memory Performance Indicators

Used/Free memory – from free.

Buffers and Cache – reclaimable memory used for I/O.

Swap usage – indicates memory pressure; controlled by swappiness.

Page faults – minor ( minflt/s) and major ( majflt/s) rates from pidstat -r.

Per‑process VSZ, RSS, %MEM – from top or pidstat -r.

Memory Optimization Strategies

Use memory pools or slab allocators to reduce fragmentation.

Leverage HugePages for large, long‑lived allocations.

Cache frequently accessed data in user space or external caches (e.g., Redis).

Limit process memory with cgroups to prevent a single application from exhausting RAM.

Adjust /proc/sys/vm/swappiness (lower values favor keeping data in RAM).

On NUMA, pin threads to the node that holds the memory they use.

Monitor and tune /proc/sys/vm/zone_reclaim_mode for local reclamation.

Key Tools and Commands

free

, top, vmstat – system‑wide overview. pidstat – per‑process CPU, memory, I/O, and context‑switch statistics. perf – profiling CPU hotspots and call graphs. strace – tracing system calls (useful for O_DIRECT detection). cachetop, cachestat, pcstat – cache hit/miss analysis. memleak (BCC) – locate memory leaks. numactl --hardware – view NUMA node distribution. sar -r -S, watch -d cat /proc/zoneinfo – monitor swap and zone thresholds.

Example: Diagnosing a High‑Load Situation

Run uptime or cat /proc/loadavg to see the load average.

If load is high, use mpstat -P ALL 5 to check per‑CPU utilization.

Identify processes with long runnable queues using pidstat -w 5.

For I/O‑bound load, inspect iostat -xz 5 and pidstat -d 5 to find heavy disk readers.

If memory pressure is suspected, check free -m and pidstat -r 5 for major page faults.

Apply targeted fixes (e.g., tune query limits, add indexes, increase cache size, adjust swappiness, bind threads).

Following this systematic approach enables developers to quickly locate and resolve performance bottlenecks in Linux environments.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Performance Optimization CPU

Written by

Liangxu Linux

Liangxu, a self‑taught IT professional now working as a Linux development engineer at a Fortune 500 multinational, shares extensive Linux knowledge—fundamentals, applications, tools, plus Git, databases, Raspberry Pi, etc. (Reply “Linux” to receive essential resources.)

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.