Master Linux Performance: Boost Throughput, Cut Latency, and Optimize CPU & Memory
This guide explains how high concurrency and fast response depend on throughput and latency, defines key performance metrics, shows how to interpret average load, CPU context switches, and memory usage, and provides practical Linux tools and command‑line examples for diagnosing and tuning system performance.
Linux Performance Optimization
Performance Metrics
High concurrency and fast response are measured by two core indicators: throughput and latency.
Performance problems arise when system resources hit a bottleneck while request handling is still too slow to sustain more traffic. Performance analysis aims to locate these bottlenecks and mitigate them.
From the application perspective: directly impacts end‑user experience.
From the system resource perspective: resource utilization, saturation, etc.
Key Steps
Select metrics to evaluate application and system performance.
Set performance targets for applications and the system.
Conduct performance baseline testing.
Analyze performance to locate bottlenecks.
Implement performance monitoring and alerts.
Understanding "Average Load"
Average load is the average number of runnable and uninterruptible processes over a time interval; it is not directly comparable to CPU utilization.
Uninterruptible processes are those in kernel‑mode critical paths (e.g., waiting for I/O). They act as a protection mechanism for processes and hardware.
When Is Average Load Reasonable?
Monitor average load in production and compare it with historical trends. If the load rises sharply, investigate promptly. A common rule of thumb is to keep average load below the number of CPU cores (or around 70% of that value).
Average load is often confused with CPU utilization, but they are not equivalent:
CPU‑intensive workloads raise both load and CPU usage.
I/O‑intensive workloads raise load while CPU usage may stay low.
Heavy process scheduling raises both load and CPU usage.
CPU
CPU Context Switch (Upper Part)
A CPU context switch saves the previous task’s registers and program counter, then loads the new task’s context before jumping to its entry point. The saved context resides in the kernel until the task is scheduled again.
Context switches are categorized by task type:
Process context switch
Thread context switch
Interrupt context switch
Process Context Switch
Linux separates kernel space and user space. A system call triggers two context switches: user → kernel (saving user registers, loading kernel registers) and kernel → user (restoring user registers).
System calls are technically privilege‑mode switches, not full process switches.
Process switches occur only when the scheduler runs a process on the CPU, e.g., time‑slice rotation, blocked processes, explicit sleep, preemption by higher‑priority tasks, or hardware interrupts.
Thread Context Switch
Two cases exist:
Threads belong to the same process – only thread‑local data and registers change; virtual memory stays the same.
Threads belong to different processes – same cost as a process switch.
Intra‑process thread switches consume fewer resources, which is why multithreading can be advantageous.
Interrupt Context Switch
Interrupt switches involve only kernel‑mode state (CPU registers, kernel stack, hardware parameters) and never occur simultaneously with process switches because interrupt priority exceeds process priority.
CPU Context Switch (Lower Part)
Use
vmstatto view overall context‑switch statistics:
<code>vmstat 5 # output every 5 seconds</code>
<code>procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu-----</code>
<code> r b swpd free buff cache si so bi bo in cs us sy id wa st</code>
<code> 1 0 0 103388 145412 511056 0 0 18 60 1 1 2 1 96 0 0</code>Key columns:
cs : context switches per second.
in : interrupts per second.
r : length of the run queue (processes ready or running).
b : processes in uninterruptible sleep.
To inspect per‑process switches, use
pidstat -w:
<code>pidstat -w 5</code>
<code>14:51:16 UID PID cswch/s nvcswch/s Command</code>
<code>... (sample output) ...</code>‘cswch’ counts voluntary switches (resource contention); ‘nvcswch’ counts involuntary switches (scheduler preemption).
CPU Performance Indicators
CPU usage (user, system, iowait, soft/hard IRQ, steal/guest).
Average load (ideal ≈ number of logical CPUs).
Process context switches (voluntary vs. involuntary).
CPU cache hit rate (L1/L2/L3).
Performance Tools
Use
uptimeto view average load.
Combine
mpstatand
pidstatto pinpoint high‑load processes.
Use
topfor quick CPU usage overview.
Apply
perf top/
perf record/
perf reportto drill into hot functions.
For I/O‑related issues, examine
/proc/softirqsand
sar -r -S.
Memory
How Linux Memory Works
Only the kernel can access physical RAM directly. Each process receives an isolated virtual address space that appears contiguous, which the kernel maps to physical pages via page tables stored in the MMU.
When a virtual address is not present in the page table, a page‑fault occurs; the kernel allocates a physical page, updates the page table, and resumes the process.
Linux uses multi‑level page tables and HugePages to reduce page‑table overhead.
Virtual Memory Layout
Read‑only segment : code and constants.
Data segment : global variables.
Heap : dynamically allocated memory, grows upward.
Memory‑mapped segment : shared libraries, mmap’ed files, grows downward.
Stack : local variables and call frames, typically 8 MiB.
Allocation & Reclamation
Allocation
brk()for small allocations (<128 KiB) – moves the heap top.
mmap()for large allocations (>128 KiB) – reserves address space in the mmap region.
Both allocate virtual memory; physical pages are only committed on first access (minor page fault).
Reclamation
Cache reclamation via LRU.
Swap out rarely used pages.
OOM killer terminates memory‑hogging processes (adjustable via
/proc/<pid>/oom_adj).
Example to lower OOM score:
<code>echo -16 > /proc/$(pidof myapp)/oom_adj</code>Viewing Memory Usage
free– overall system memory.
top/
ps– per‑process memory (VIRT, RES, SHR, %MEM).
Buffers vs. Cache
Buffers cache raw disk blocks; cache stores file data. Both accelerate I/O but consume RAM that can be reclaimed when needed.
Cache Hit Rate
Higher cache hit rates mean more requests are served from RAM, improving performance. Tools such as
cachestat,
cachetop, and
pcstatcan measure hit rates.
Direct I/O (O_DIRECT)
When a program opens a file with
O_DIRECT, the kernel bypasses the page cache, leading to slower reads if the data is not already in memory.
<code>strace -p $(pgrep app)</code>Memory Leaks
Leaks occur when allocated heap memory is never freed, or when out‑of‑bounds accesses cause crashes. Use BCC’s
memleaktool to trace allocations and identify leaking call stacks.
<code>/usr/share/bcc/tools/memleak -a -p $(pidof app)</code>Swap
When RAM is scarce, Linux swaps out anonymous pages to disk. Swap activity can be tuned via
/proc/sys/vm/swappiness(0‑100). NUMA architectures may cause swap usage even when local memory appears sufficient.
Check swap usage with
freeand monitor with
sar -r -Sor
vmstat.
Quick Memory Bottleneck Analysis
Start with
freeand
topfor a high‑level view.
Use
vmstatand
pidstatover time to spot trends.
Drill down with allocation analysis, cache inspection, and per‑process diagnostics.
Common recommendations: avoid swap when possible, lower
swappiness, use memory pools or HugePages, leverage caches, apply cgroups limits, and adjust OOM scores for critical services.
Source: https://www.ctq6.cn/linux%E6%80%A7%E8%83%BD%E4%BC%98%E5%8C%96/
Efficient Ops
This public account is maintained by Xiaotianguo and friends, regularly publishing widely-read original technical articles. We focus on operations transformation and accompany you throughout your operations career, growing together happily.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.