Operations 48 min read

Unlock Linux Performance: Understanding Load, CPU Context Switches, and Memory Optimization

This guide explains Linux performance optimization by covering key metrics such as throughput, latency, load average, CPU context switching, and memory management, and demonstrates how to use built‑in tools like vmstat, pidstat, perf, and cachetop to diagnose and resolve bottlenecks.

Open Source Linux

Feb 10, 2022

Unlock Linux Performance: Understanding Load, CPU Context Switches, and Memory Optimization

Part 1 Linux Performance Optimization

Performance Indicators

High concurrency and fast response correspond to two core metrics of performance optimization: throughput and latency .

Application load : directly impacts end‑user experience.

System resources : resource utilization and saturation.

The essence of a performance problem is that system resources have reached a bottleneck while request processing is still not fast enough to handle more requests. Performance analysis means finding the bottleneck in the application or system and mitigating it.

Select metrics to evaluate application and system performance.

Set performance goals for the application and system.

Conduct performance baseline testing.

Analyze performance to locate bottlenecks.

Monitor performance and set alerts.

Different performance problems require different analysis tools. Below is a list of common Linux performance tools and the types of issues they address.

Understanding "Average Load"

Average load is the average number of processes in runnable or uninterruptible states during a time interval; it is not directly related to CPU utilization.

Uninterruptible processes are those in kernel‑mode critical paths (e.g., waiting for I/O). This state is a protection mechanism for processes and hardware devices.

When Is Average Load Reasonable?

In production, monitor the average load over time. If the load shows a clear upward trend, investigate immediately. A common rule of thumb is to set a threshold at about 70% of the number of CPU cores.

CPU‑intensive workloads raise both load and CPU usage.

I/O‑intensive workloads raise load while CPU usage may stay low.

Heavy process scheduling raises both load and CPU usage.

High load often indicates either CPU‑intensive processes or I/O‑busy processes. Use mpstat or pidstat to pinpoint the source.

2 CPU

CPU Context Switch (Upper)

CPU context switch saves the CPU registers and program counter of the current task, loads the registers of the next task, and jumps to the new task. The saved context resides in the kernel until the task is scheduled again.

Context switches are classified by task type:

Process context switch

Thread context switch

Interrupt context switch

Process Context Switch

Linux separates kernel space and user space. Switching from user to kernel mode occurs via a system call, which actually performs two context switches:

Save user‑mode instruction pointer, load kernel‑mode pointer, and jump to kernel code.

After the system call, restore the saved user registers and return to user space.

System calls are usually called "privileged mode switches" because they do not involve user‑space resources.

Process context switches happen when the scheduler allocates a CPU time slice, a process is blocked, a process voluntarily sleeps, a high‑priority process preempts, or a hardware interrupt occurs.

Thread Context Switch

Thread switches come in two forms:

Switching between threads of the same process – only thread‑local data and registers change; virtual memory stays the same.

Switching between threads of different processes – same cost as a process switch.

In‑process thread switches consume fewer resources, which is a key advantage of multithreading.

Interrupt Context Switch

Interrupt context switches involve only kernel‑mode interrupt handlers; they do not include user‑space state.

Interrupt handling has higher priority than processes, so interrupt and process context switches never occur simultaneously.

CPU Context Switch (Lower)

Use vmstat to view overall context‑switch and interrupt statistics:

vmstat 5   # output every 5 seconds

cs

– context switches per second. in – interrupts per second. r – length of the runnable queue (processes ready or running). b – number of processes in uninterruptible sleep.

To see per‑process details, use pidstat -w: pidstat -w 5 Key fields: cswch/s – voluntary switches (blocked on resources). nvcswch/s – involuntary switches (forced by the scheduler).

Example workflow: vmstat 1 1 → run a load test with sysbench --threads=10 --max-time=300 threads run → observe vmstat 1 1 again. A spike in cs together with high r and sy indicates heavy kernel activity.

Identify the offending process with pidstat -w -p <pid> or ps. If the culprit is a short‑lived process, strace may miss it; use perf record -d and perf report instead.

CPU Performance Metrics

CPU usage (user, system, iowait, soft/hard IRQ, steal/guest).

Average load – ideal value equals the number of logical CPUs.

Process context switches – voluntary vs. involuntary.

CPU cache hit rate (L1/L2/L3).

Performance Tools

Average‑load case: uptime → mpstat / pidstat to locate high‑load processes.

Context‑switch case: vmstat → pidstat (voluntary/involuntary) → thread‑level pidstat.

High‑CPU‑process case: top → perf top → pinpoint hot functions.

High‑system‑CPU case: examine top, then focus on running processes, use perf record/report for short‑lived processes.

Uninterruptible/Zombie case: use top, pstree, strace, and perf to trace I/O.

Soft‑IRQ case: top → /proc/softirqs → sar → tcpdump to identify SYN‑FLOOD attacks.

In production, developers often cannot install new tools, so they must maximize the use of built‑in utilities such as top, vmstat, and pidstat.

3 Memory

How Linux Memory Works

Memory Mapping

Only the kernel can access physical DRAM. Linux gives each process an isolated, contiguous virtual address space, split into kernel space and user space. The kernel maintains a page table for each process that maps virtual pages to physical frames; the MMU hardware performs the translation.

When a process accesses a virtual address not present in the page table, a page‑fault exception occurs, the kernel allocates a physical page, updates the page table, and resumes the process.

Linux uses multi‑level page tables and HugePages to reduce the overhead of managing many 4 KB pages.

Virtual‑Memory Layout

Read‑only segment – code and constants.

Data segment – global variables.

Heap – dynamically allocated memory, grows upward.

Memory‑mapped region – shared libraries, mmap files, grows downward.

Stack – local variables and call context, fixed size (typically 8 MB).

Allocation and Release

Allocation

malloc

is implemented via two system calls:

brk() – for small allocations (<128 KB) by moving the program break; freed memory is cached.

mmap() – for large allocations (>128 KB) by mapping anonymous memory; released memory is returned to the kernel immediately.

Both mechanisms allocate physical memory lazily – only on first access (page fault).

Reclamation

When memory is scarce, the kernel reclaims memory by:

Reclaiming cache pages using LRU.

Swapping out rarely used pages.

Invoking the OOM killer (processes with higher oom_score are terminated first).

echo -16 > /proc/$(pidof XXX)/oom_adj

How to View Memory Usage

System‑wide: free. Per‑process: top / ps. Important fields:

VIRT – virtual memory size.

RES – resident (physical) memory.

SHR – shared memory (libraries, shared segments).

%MEM – proportion of total RAM used.

Buffers vs. Cache

Buffer caches raw disk blocks; Cache caches file data. Both accelerate reads and writes.

Optimizing with System Cache

Cache Hit Rate

Cache hit rate = cache hits / total requests. Higher hit rate means better performance.

Install BCC tools and use cachestat or cachetop to monitor.

dd Cache Acceleration

# Create a 512 MB temporary file
dd if=/dev/sda1 of=file bs=1M count=512
# Drop caches
echo 3 > /proc/sys/vm/drop_caches
# Verify file is not cached
pcstat file
# Measure read speed (cached)
dd if=file of=/dev/null bs=1M
# After caching, speed jumps to >4 GB/s
dd if=file of=/dev/null bs=1M
pcstat file

O_DIRECT Bypassing Cache

cachetop 5
sudo docker run --privileged --name=app -itd feisky/app:io-direct
# The app opens the device with O_RDONLY|O_DIRECT, causing slow reads.
strace -p $(pgrep app)

Memory Leaks: Detection and Fixing

Leaks occur when allocated memory is never freed or when out‑of‑bounds accesses cause crashes.

Use BCC's memleak tool: /usr/share/bcc/tools/memleak -a -p $(pidof app) The output shows the leaking allocation site (e.g., a fibonacci function) so the code can be corrected.

Why Swap Grows

When RAM is tight, the kernel reclaims memory by swapping out anonymous pages, flushing dirty pages, or dropping caches. Swap activity is governed by /proc/sys/vm/swappiness (0‑100). A higher value makes the kernel swap more aggressively.

NUMA and Swap

On NUMA systems each node has local memory. Swap may increase even when overall memory is abundant because a node runs out of local pages.

numactl --hardware   # show node memory distribution

Analyzing Swap Usage

# Create and enable swap
fallocate -l 8G /mnt/swapfile
chmod 600 /mnt/swapfile
mkswap /mnt/swapfile
swapon /mnt/swapfile
# Simulate heavy I/O
dd if=/dev/sda1 of=/dev/null bs=1G count=2048
sar -r -S 1   # monitor memory and swap
watch -d grep -A15 'Normal' /proc/zoneinfo

Observe the interplay between free memory, buffers, cache, and swap. Adjust swappiness or /proc/sys/vm/zone_raclaim_mode to control how aggressively the kernel swaps.

Quick Memory‑Performance Diagnosis

Start with free and top for a global view.

Use vmstat and pidstat over time to spot trends.

Drill down with memleak, cachetop, or perf for detailed analysis.

Common optimization ideas:

Disable swap or set a low swappiness.

Reduce dynamic allocations (use memory pools, HugePages).

Leverage caches (in‑process buffers, external caches like Redis).

Apply cgroups to limit per‑process memory usage.

Adjust /proc/pid/oom_adj to protect critical services.

vmstat Detailed Usage

vmstat 2
# Columns:
# r – runnable processes (run queue length)
# b – blocked processes
# swpd – used swap (non‑zero means RAM shortage)
# free – free memory
# buff – buffer cache (raw disk blocks)
# cache – page cache (file data)
# si/so – swap in/out per second
# bi/bo – block I/O per second
# in – interrupts per second
# cs – context switches per second
# us – user CPU time
# sy – system CPU time
# id – idle CPU time
# wa – I/O wait time
# st – stolen time (virtualized environments)

pidstat Detailed Usage

# CPU usage per process
pidstat -u 1 10
# Memory usage per process
pidstat -r 1 10
# I/O usage per process
pidstat -d 1 10
# Context switches per process
pidstat -w 1 10
# Specific PID
pidstat -p 20955 -r 1 10

Key fields include %usr, %system, %CPU, VSZ, RSS, %MEM, kB_rd/s, kB_wr/s, cswch/s, nvcswch/s, etc.

Memory‑Performance Tools Overview

Tools such as free, top, vmstat, pidstat, perf, cachetop, memleak, and BCC utilities together cover most memory‑related performance questions.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Performance Optimization Linux CPU Memory

Written by

Open Source Linux

Focused on sharing Linux/Unix content, covering fundamentals, system development, network programming, automation/operations, cloud computing, and related professional knowledge.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.