Operations 48 min read

Unlock Linux Performance: Understanding Load, CPU Context Switches, and Memory Optimization

This guide explains Linux performance optimization by covering key metrics such as throughput, latency, load average, CPU context switching, and memory management, and demonstrates how to use built‑in tools like vmstat, pidstat, perf, and cachetop to diagnose and resolve bottlenecks.

Open Source Linux
Open Source Linux
Open Source Linux
Unlock Linux Performance: Understanding Load, CPU Context Switches, and Memory Optimization

Part 1 Linux Performance Optimization

Performance Indicators

High concurrency and fast response correspond to two core metrics of performance optimization: throughput and latency .

Performance metrics diagram
Performance metrics diagram

Application load : directly impacts end‑user experience.

System resources : resource utilization and saturation.

The essence of a performance problem is that system resources have reached a bottleneck while request processing is still not fast enough to handle more requests. Performance analysis means finding the bottleneck in the application or system and mitigating it.

Select metrics to evaluate application and system performance.

Set performance goals for the application and system.

Conduct performance baseline testing.

Analyze performance to locate bottlenecks.

Monitor performance and set alerts.

Different performance problems require different analysis tools. Below is a list of common Linux performance tools and the types of issues they address.

Linux performance tools
Linux performance tools

Understanding "Average Load"

Average load is the average number of processes in runnable or uninterruptible states during a time interval; it is not directly related to CPU utilization.

Uninterruptible processes are those in kernel‑mode critical paths (e.g., waiting for I/O). This state is a protection mechanism for processes and hardware devices.

When Is Average Load Reasonable?

In production, monitor the average load over time. If the load shows a clear upward trend, investigate immediately. A common rule of thumb is to set a threshold at about 70% of the number of CPU cores.

CPU‑intensive workloads raise both load and CPU usage.

I/O‑intensive workloads raise load while CPU usage may stay low.

Heavy process scheduling raises both load and CPU usage.

High load often indicates either CPU‑intensive processes or I/O‑busy processes. Use

mpstat

or

pidstat

to pinpoint the source.

2 CPU

CPU Context Switch (Upper)

CPU context switch saves the CPU registers and program counter of the current task, loads the registers of the next task, and jumps to the new task. The saved context resides in the kernel until the task is scheduled again.

Context switches are classified by task type:

Process context switch

Thread context switch

Interrupt context switch

Process Context Switch

Linux separates kernel space and user space. Switching from user to kernel mode occurs via a system call, which actually performs two context switches:

Save user‑mode instruction pointer, load kernel‑mode pointer, and jump to kernel code.

After the system call, restore the saved user registers and return to user space.

System calls are usually called "privileged mode switches" because they do not involve user‑space resources.

Process context switches happen when the scheduler allocates a CPU time slice, a process is blocked, a process voluntarily sleeps, a high‑priority process preempts, or a hardware interrupt occurs.

Thread Context Switch

Thread switches come in two forms:

Switching between threads of the same process – only thread‑local data and registers change; virtual memory stays the same.

Switching between threads of different processes – same cost as a process switch.

In‑process thread switches consume fewer resources, which is a key advantage of multithreading.

Interrupt Context Switch

Interrupt context switches involve only kernel‑mode interrupt handlers; they do not include user‑space state.

Interrupt handling has higher priority than processes, so interrupt and process context switches never occur simultaneously.

CPU Context Switch (Lower)

Use

vmstat

to view overall context‑switch and interrupt statistics:

vmstat 5   # output every 5 seconds
cs

– context switches per second.

in

– interrupts per second.

r

– length of the runnable queue (processes ready or running).

b

– number of processes in uninterruptible sleep.

To see per‑process details, use

pidstat -w

:

pidstat -w 5

Key fields:

cswch/s

– voluntary switches (blocked on resources).

nvcswch/s

– involuntary switches (forced by the scheduler).

Example workflow:

vmstat 1 1

→ run a load test with

sysbench --threads=10 --max-time=300 threads run

→ observe

vmstat 1 1

again. A spike in

cs

together with high

r

and

sy

indicates heavy kernel activity.

Identify the offending process with

pidstat -w -p <pid>

or

ps

. If the culprit is a short‑lived process,

strace

may miss it; use

perf record -d

and

perf report

instead.

CPU Performance Metrics

CPU usage (user, system, iowait, soft/hard IRQ, steal/guest).

Average load – ideal value equals the number of logical CPUs.

Process context switches – voluntary vs. involuntary.

CPU cache hit rate (L1/L2/L3).

Performance Tools

Average‑load case:

uptime

mpstat

/

pidstat

to locate high‑load processes.

Context‑switch case:

vmstat

pidstat

(voluntary/involuntary) → thread‑level

pidstat

.

High‑CPU‑process case:

top

perf top

→ pinpoint hot functions.

High‑system‑CPU case: examine

top

, then focus on running processes, use

perf record/report

for short‑lived processes.

Uninterruptible/Zombie case: use

top

,

pstree

,

strace

, and

perf

to trace I/O.

Soft‑IRQ case:

top

/proc/softirqs

sar

tcpdump

to identify SYN‑FLOOD attacks.

In production, developers often cannot install new tools, so they must maximize the use of built‑in utilities such as

top

,

vmstat

, and

pidstat

.

3 Memory

How Linux Memory Works

Memory Mapping

Only the kernel can access physical DRAM. Linux gives each process an isolated, contiguous virtual address space, split into kernel space and user space. The kernel maintains a page table for each process that maps virtual pages to physical frames; the MMU hardware performs the translation.

When a process accesses a virtual address not present in the page table, a page‑fault exception occurs, the kernel allocates a physical page, updates the page table, and resumes the process.

Linux uses multi‑level page tables and HugePages to reduce the overhead of managing many 4 KB pages.

Virtual‑Memory Layout

Read‑only segment – code and constants.

Data segment – global variables.

Heap – dynamically allocated memory, grows upward.

Memory‑mapped region – shared libraries, mmap files, grows downward.

Stack – local variables and call context, fixed size (typically 8 MB).

Allocation and Release

Allocation

malloc

is implemented via two system calls:

brk() – for small allocations (<128 KB) by moving the program break; freed memory is cached.

mmap() – for large allocations (>128 KB) by mapping anonymous memory; released memory is returned to the kernel immediately.

Both mechanisms allocate physical memory lazily – only on first access (page fault).

Reclamation

When memory is scarce, the kernel reclaims memory by:

Reclaiming cache pages using LRU.

Swapping out rarely used pages.

Invoking the OOM killer (processes with higher

oom_score

are terminated first).

echo -16 > /proc/$(pidof XXX)/oom_adj

How to View Memory Usage

System‑wide:

free

. Per‑process:

top

/

ps

. Important fields:

VIRT – virtual memory size.

RES – resident (physical) memory.

SHR – shared memory (libraries, shared segments).

%MEM – proportion of total RAM used.

Buffers vs. Cache

Buffer caches raw disk blocks; Cache caches file data. Both accelerate reads and writes.

Optimizing with System Cache

Cache Hit Rate

Cache hit rate = cache hits / total requests. Higher hit rate means better performance.

Install BCC tools and use

cachestat

or

cachetop

to monitor.

dd Cache Acceleration

# Create a 512 MB temporary file
dd if=/dev/sda1 of=file bs=1M count=512
# Drop caches
echo 3 > /proc/sys/vm/drop_caches
# Verify file is not cached
pcstat file
# Measure read speed (cached)
dd if=file of=/dev/null bs=1M
# After caching, speed jumps to >4 GB/s
dd if=file of=/dev/null bs=1M
pcstat file

O_DIRECT Bypassing Cache

cachetop 5
sudo docker run --privileged --name=app -itd feisky/app:io-direct
# The app opens the device with O_RDONLY|O_DIRECT, causing slow reads.
strace -p $(pgrep app)

Memory Leaks: Detection and Fixing

Leaks occur when allocated memory is never freed or when out‑of‑bounds accesses cause crashes.

Use BCC's

memleak

tool:

/usr/share/bcc/tools/memleak -a -p $(pidof app)

The output shows the leaking allocation site (e.g., a

fibonacci

function) so the code can be corrected.

Why Swap Grows

When RAM is tight, the kernel reclaims memory by swapping out anonymous pages, flushing dirty pages, or dropping caches. Swap activity is governed by

/proc/sys/vm/swappiness

(0‑100). A higher value makes the kernel swap more aggressively.

NUMA and Swap

On NUMA systems each node has local memory. Swap may increase even when overall memory is abundant because a node runs out of local pages.

numactl --hardware   # show node memory distribution

Analyzing Swap Usage

# Create and enable swap
fallocate -l 8G /mnt/swapfile
chmod 600 /mnt/swapfile
mkswap /mnt/swapfile
swapon /mnt/swapfile
# Simulate heavy I/O
dd if=/dev/sda1 of=/dev/null bs=1G count=2048
sar -r -S 1   # monitor memory and swap
watch -d grep -A15 'Normal' /proc/zoneinfo

Observe the interplay between free memory, buffers, cache, and swap. Adjust

swappiness

or

/proc/sys/vm/zone_raclaim_mode

to control how aggressively the kernel swaps.

Quick Memory‑Performance Diagnosis

Start with

free

and

top

for a global view.

Use

vmstat

and

pidstat

over time to spot trends.

Drill down with

memleak

,

cachetop

, or

perf

for detailed analysis.

Common optimization ideas:

Disable swap or set a low

swappiness

.

Reduce dynamic allocations (use memory pools, HugePages).

Leverage caches (in‑process buffers, external caches like Redis).

Apply cgroups to limit per‑process memory usage.

Adjust

/proc/pid/oom_adj

to protect critical services.

vmstat Detailed Usage

vmstat 2
# Columns:
# r – runnable processes (run queue length)
# b – blocked processes
# swpd – used swap (non‑zero means RAM shortage)
# free – free memory
# buff – buffer cache (raw disk blocks)
# cache – page cache (file data)
# si/so – swap in/out per second
# bi/bo – block I/O per second
# in – interrupts per second
# cs – context switches per second
# us – user CPU time
# sy – system CPU time
# id – idle CPU time
# wa – I/O wait time
# st – stolen time (virtualized environments)

pidstat Detailed Usage

# CPU usage per process
pidstat -u 1 10
# Memory usage per process
pidstat -r 1 10
# I/O usage per process
pidstat -d 1 10
# Context switches per process
pidstat -w 1 10
# Specific PID
pidstat -p 20955 -r 1 10

Key fields include %usr, %system, %CPU, VSZ, RSS, %MEM, kB_rd/s, kB_wr/s, cswch/s, nvcswch/s, etc.

Memory‑Performance Tools Overview

Memory tool matrix
Memory tool matrix

Tools such as

free

,

top

,

vmstat

,

pidstat

,

perf

,

cachetop

,

memleak

, and BCC utilities together cover most memory‑related performance questions.

monitoringPerformanceOptimizationLinuxCPUmemory
Open Source Linux
Written by

Open Source Linux

Focused on sharing Linux/Unix content, covering fundamentals, system development, network programming, automation/operations, cloud computing, and related professional knowledge.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.