Operations 44 min read

Master Linux Performance: Boost Throughput, Cut Latency, and Optimize CPU & Memory

This guide explains how high concurrency and fast response depend on throughput and latency, defines key performance metrics, shows how to interpret average load, CPU context switches, and memory usage, and provides practical Linux tools and command‑line examples for diagnosing and tuning system performance.

Efficient Ops

Sep 5, 2023

Master Linux Performance: Boost Throughput, Cut Latency, and Optimize CPU & Memory

Linux Performance Optimization

Performance Metrics

High concurrency and fast response are measured by two core indicators: throughput and latency.

Performance problems arise when system resources hit a bottleneck while request handling is still too slow to sustain more traffic. Performance analysis aims to locate these bottlenecks and mitigate them.

From the application perspective: directly impacts end‑user experience.

From the system resource perspective: resource utilization, saturation, etc.

Key Steps

Select metrics to evaluate application and system performance.

Set performance targets for applications and the system.

Conduct performance baseline testing.

Analyze performance to locate bottlenecks.

Implement performance monitoring and alerts.

Understanding "Average Load"

Average load is the average number of runnable and uninterruptible processes over a time interval; it is not directly comparable to CPU utilization.

Uninterruptible processes are those in kernel‑mode critical paths (e.g., waiting for I/O). They act as a protection mechanism for processes and hardware.

When Is Average Load Reasonable?

Monitor average load in production and compare it with historical trends. If the load rises sharply, investigate promptly. A common rule of thumb is to keep average load below the number of CPU cores (or around 70% of that value).

Average load is often confused with CPU utilization, but they are not equivalent:

CPU‑intensive workloads raise both load and CPU usage.

I/O‑intensive workloads raise load while CPU usage may stay low.

Heavy process scheduling raises both load and CPU usage.

CPU

CPU Context Switch (Upper Part)

A CPU context switch saves the previous task’s registers and program counter, then loads the new task’s context before jumping to its entry point. The saved context resides in the kernel until the task is scheduled again.

Context switches are categorized by task type:

Process context switch

Thread context switch

Interrupt context switch

Process Context Switch

Linux separates kernel space and user space. A system call triggers two context switches: user → kernel (saving user registers, loading kernel registers) and kernel → user (restoring user registers).

System calls are technically privilege‑mode switches, not full process switches.

Process switches occur only when the scheduler runs a process on the CPU, e.g., time‑slice rotation, blocked processes, explicit sleep, preemption by higher‑priority tasks, or hardware interrupts.

Thread Context Switch

Two cases exist:

Threads belong to the same process – only thread‑local data and registers change; virtual memory stays the same.

Threads belong to different processes – same cost as a process switch.

Intra‑process thread switches consume fewer resources, which is why multithreading can be advantageous.

Interrupt Context Switch

Interrupt switches involve only kernel‑mode state (CPU registers, kernel stack, hardware parameters) and never occur simultaneously with process switches because interrupt priority exceeds process priority.

CPU Context Switch (Lower Part)

Use vmstat to view overall context‑switch statistics:

vmstat 5            # output every 5 seconds</code>
<code>procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu-----</code>
<code> r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa st</code>
<code> 1  0      0 103388 145412 511056    0    0    18    60    1    1  2  1 96  0  0

Key columns:

cs : context switches per second.

in : interrupts per second.

r : length of the run queue (processes ready or running).

b : processes in uninterruptible sleep.

To inspect per‑process switches, use pidstat -w:

pidstat -w 5</code>
<code>14:51:16   UID   PID   cswch/s  nvcswch/s  Command</code>
<code>... (sample output) ...

‘cswch’ counts voluntary switches (resource contention); ‘nvcswch’ counts involuntary switches (scheduler preemption).

CPU Performance Indicators

CPU usage (user, system, iowait, soft/hard IRQ, steal/guest).

Average load (ideal ≈ number of logical CPUs).

Process context switches (voluntary vs. involuntary).

CPU cache hit rate (L1/L2/L3).

Performance Tools

Use uptime to view average load.

Combine mpstat and pidstat to pinpoint high‑load processes.

Use top for quick CPU usage overview.

Apply perf top / perf record / perf report to drill into hot functions.

For I/O‑related issues, examine /proc/softirqs and sar -r -S.

Memory

How Linux Memory Works

Only the kernel can access physical RAM directly. Each process receives an isolated virtual address space that appears contiguous, which the kernel maps to physical pages via page tables stored in the MMU.

When a virtual address is not present in the page table, a page‑fault occurs; the kernel allocates a physical page, updates the page table, and resumes the process.

Linux uses multi‑level page tables and HugePages to reduce page‑table overhead.

Virtual Memory Layout

Read‑only segment : code and constants.

Data segment : global variables.

Heap : dynamically allocated memory, grows upward.

Memory‑mapped segment : shared libraries, mmap’ed files, grows downward.

Stack : local variables and call frames, typically 8 MiB.

Allocation & Reclamation

Allocation brk() for small allocations (<128 KiB) – moves the heap top. mmap() for large allocations (>128 KiB) – reserves address space in the mmap region.

Both allocate virtual memory; physical pages are only committed on first access (minor page fault).

Reclamation

Cache reclamation via LRU.

Swap out rarely used pages.

OOM killer terminates memory‑hogging processes (adjustable via /proc/<pid>/oom_adj).

Example to lower OOM score:

echo -16 > /proc/$(pidof myapp)/oom_adj

Viewing Memory Usage

free

– overall system memory. top / ps – per‑process memory (VIRT, RES, SHR, %MEM).

Buffers vs. Cache

Buffers cache raw disk blocks; cache stores file data. Both accelerate I/O but consume RAM that can be reclaimed when needed.

Cache Hit Rate

Higher cache hit rates mean more requests are served from RAM, improving performance. Tools such as cachestat, cachetop, and pcstat can measure hit rates.

Direct I/O (O_DIRECT)

When a program opens a file with O_DIRECT, the kernel bypasses the page cache, leading to slower reads if the data is not already in memory.

strace -p $(pgrep app)

Memory Leaks

Leaks occur when allocated heap memory is never freed, or when out‑of‑bounds accesses cause crashes. Use BCC’s memleak tool to trace allocations and identify leaking call stacks.

/usr/share/bcc/tools/memleak -a -p $(pidof app)

Swap

When RAM is scarce, Linux swaps out anonymous pages to disk. Swap activity can be tuned via /proc/sys/vm/swappiness (0‑100). NUMA architectures may cause swap usage even when local memory appears sufficient.

Check swap usage with free and monitor with sar -r -S or vmstat.

Quick Memory Bottleneck Analysis

Start with free and top for a high‑level view.

Use vmstat and pidstat over time to spot trends.

Drill down with allocation analysis, cache inspection, and per‑process diagnostics.

Common recommendations: avoid swap when possible, lower swappiness, use memory pools or HugePages, leverage caches, apply cgroups limits, and adjust OOM scores for critical services.

Source: https://www.ctq6.cn/linux%E6%80%A7%E8%83%BD%E4%BC%98%E5%8C%96/

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Memory Management CPU optimization System Monitoring performance tools Linux performance

Written by

Efficient Ops

This public account is maintained by Xiaotianguo and friends, regularly publishing widely-read original technical articles. We focus on operations transformation and accompany you throughout your operations career, growing together happily.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.