Operations 46 min read

Unlock Linux Performance: Master Metrics, Tools, and Optimization Techniques

This guide explains Linux performance optimization by defining key metrics such as throughput, latency, average load, and CPU usage, describing how to select and interpret tools like vmstat, pidstat, perf, and dstat, and offering concrete steps to diagnose and fix CPU, memory, I/O, and context‑switch bottlenecks.

ITPUB
ITPUB
ITPUB
Unlock Linux Performance: Master Metrics, Tools, and Optimization Techniques

Linux Performance Optimization Overview

High concurrency and low latency are measured by throughput and latency . A performance problem occurs when system resources hit a bottleneck while request handling remains too slow.

Key Performance Indicators

Application load impact on user experience

System resource usage (utilization, saturation)

Typical Analysis Workflow

Select metrics and set performance goals.

Run benchmarks (e.g., ab -c 10 -n 100 http://host:port/).

Locate bottlenecks with broad‑scope tools ( top, vmstat, pidstat).

Drill down using per‑process/thread tools ( pidstat -w, pidstat -u, perf).

Apply targeted fixes (code changes, configuration, resource limits) and re‑measure.

Understanding Average Load

Average load is the average number of runnable or uninterruptible processes over a time interval. It is not directly equivalent to CPU utilization. Uninterruptible (D) processes are typically waiting for I/O.

A practical rule of thumb: keep average load below 70% of the number of CPU cores. Sudden spikes indicate a potential bottleneck.

CPU Metrics and Context Switching

Use vmstat 5 to view overall context‑switch and interrupt rates:

vmstat 5
procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu-----
 r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa st
 1  0      0 103388 145412 511056    0    0    18    60    1    1  2  1 96  0  0
 ...

Key fields: cs: context switches per second (voluntary vs. involuntary) in: interrupts per second r: length of the run queue (processes ready or running) b: blocked (uninterruptible) processes

For per‑process details, use pidstat -w 5 (context switches) and pidstat -w -u 1 (thread‑level). Example output:

pidstat -w 5
14:51:16   UID   PID  cswch/s  nvcswch/s  Command
14:51:21     0     1     0.80        0.00  systemd
14:51:21     0     6     1.40        0.00  ksoftirqd/0
14:51:21     0     9    32.67        0.00  rcu_sched

High cs together with a large r value often means excessive scheduling overhead.

CPU Usage Analysis

Quick view with top or ps. For deeper insight, use perf:

perf top -g -p <em>PID</em>
perf record -d -p <em>PID</em>
perf report -g

Example: a sysbench benchmark showed high CPU usage; perf top revealed hot functions sqrt and add_function. Removing a stray test loop in sqrt dramatically improved Nginx throughput.

Diagnosing High System CPU When No Process Stands Out

When top shows high overall CPU but no single process dominates, examine the run‑queue length ( r) and the number of processes in Running state. Short‑lived or repeatedly crashing processes can be missed. Use pstree -aps to trace parent‑child relationships and uncover hidden load sources.

Memory Management Fundamentals

Linux gives each process an isolated virtual address space split into kernel and user regions. Physical memory is allocated on demand via page faults, and the MMU translates virtual to physical addresses using multi‑level page tables.

Typical virtual‑memory layout (low to high addresses):

Read‑only segment (code, constants)

Data segment (globals)

Heap (dynamic allocations via brk() for < 128 KB, mmap() for larger blocks)

Memory‑mapped region (shared libraries, mmap)

Stack (local variables, call frames)

Memory Allocation & Reclamation

Allocation : brk() expands the heap; freed memory stays cached. mmap() creates a new mapping; freed memory is returned to the system, causing page‑faults on next use.

Reclamation : LRU cache eviction, swapping out rarely used pages, and the OOM killer for runaway processes. Adjust /proc/sys/vm/swappiness (0‑100) to control swap aggressiveness.

Monitoring Memory Usage

Common commands: free – overall memory and swap. top / ps – per‑process VIRT, RES, SHR, %MEM. pidstat -r 1 10 – per‑process page‑fault rates and RSS.

pidstat -r 1 10
UID   PID  minflt/s  majflt/s   VSZ    RSS   %MEM  Command
0     1      0.20      0.00 191256   3064   0.01  systemd

Detecting Memory Leaks

Use BCC's memleak to trace allocations that are never freed:

/usr/share/bcc/tools/memleak -a -p $(pidof app)

Identify the leaking function (e.g., a Fibonacci routine) and add the missing free() calls.

Swap and NUMA Considerations

When swap usage rises despite free memory, NUMA effects may be involved. View per‑node memory with numactl --hardware and control local vs. remote reclamation via /proc/sys/vm/zone_raclaim_mode (0 = allow remote reclamation, 1/2/4 = restrict to local).

Swap aggressiveness is tuned with /proc/sys/vm/swappiness. Even with swappiness=0, swap can occur if free memory + file pages fall below the high watermark.

Performance Tool Matrix

Performance tool matrix
Performance tool matrix

The matrix maps metrics (CPU, memory, I/O, latency, etc.) to appropriate tools such as top, vmstat, pidstat, perf, dstat, BCC utilities ( memleak, cachetop), and others.

Practical Optimization Steps

Compile with gcc -O2 (or higher) for better code generation.

Prefer asynchronous I/O to avoid blocking.

Use multithreading instead of multiprocess to reduce context‑switch overhead.

Bind high‑priority services to specific CPUs (CPU affinity) and lower the nice value of less critical workloads.

Apply cgroups to cap memory and CPU usage.

Enable NUMA‑aware memory allocation (e.g., numactl --cpunodebind).

Balance interrupt handling across CPUs with irqbalance.

Typical Analysis Workflow

Run broad tools ( top, vmstat, pidstat) to locate the symptom.

Drill down with pidstat -d, pidstat -w, or perf to pinpoint the offending process or thread.

If I/O is the bottleneck, examine /proc/interrupts and use strace or perf record -d to see system calls.

Apply targeted fixes (code changes, configuration tweaks, resource limits) and re‑measure.

CPU Performance Metrics

CPU usage: user, system, iowait, soft/hard IRQ, steal/guest.

Average load – ideal value equals number of logical CPUs.

Context switches – voluntary vs. involuntary; excessive switches waste CPU cycles.

CPU cache hit rate – higher is better (L1/L2 per core, L3 shared).

Memory Performance Metrics

Used / free memory, buffers, cache, swap.

Per‑process VIRT, RES, SHR, %MEM.

Page‑fault rates (minor vs. major).

Cache and buffer hit ratios (use BCC tools cachetop, cachestat).

Key Tools Overview

top

/ htop – real‑time CPU & memory. vmstat – system‑wide CPU, memory, I/O, and interrupt statistics. pidstat – per‑process/thread CPU, memory, I/O, and context‑switch metrics. perf – hardware‑level profiling, hot‑function identification. dstat – combined CPU, disk, network, and other resources.

BCC utilities ( memleak, cachetop, cachestat) – deep kernel‑space insight.

Example: High System CPU Without Visible Culprit

When top shows high CPU but no process dominates, check the run‑queue length ( r) and look for many processes in Running state. Short‑lived or repeatedly crashing processes may be invisible to top. Use pstree -aps to find parent processes (e.g., a stress test launched by php‑fpm) and then profile the offending binary with perf.

Example: Direct I/O Causing High iowait

In a container running an app with O_DIRECT, iowait spikes while CPU usage stays low. strace -p shows openat(..., O_RDONLY|O_DIRECT). Removing the O_DIRECT flag restores cache usage and reduces iowait.

Example: Memory Leak Detection

Run the leaking application in a Docker container, then execute:

/usr/share/bcc/tools/memleak -a -p $(pidof app)

The output points to the leaking function; adding the missing free() eliminates the leak.

Example: Swap Increase Diagnosis

Monitor swap with free and sar -r -S 1. If swap rises while buffers dominate memory, use cachetop to check cache hit rate. Low hit rate indicates heavy I/O; investigate the responsible process with pidstat -d and strace or perf record.

Quick Reference Commands

vmstat 2

– system snapshot every 2 seconds. pidstat -w 5 – context switches per second. pidstat -u 1 10 – CPU usage per process. pidstat -r 1 10 – memory usage per process. perf top -g -p <em>PID</em> – live hot‑function view. strace -p <em>PID</em> – trace system calls. numactl --hardware – NUMA node layout. cat /proc/interrupts – interrupt distribution. pstree -aps <em>PID</em> – process ancestry.

Optimization Checklist

Compile with optimization flags (e.g., -O2).

Use asynchronous I/O and event‑driven designs.

Prefer threads over processes to reduce context switches.

Set CPU affinity and appropriate nice values.

Apply cgroups limits for CPU and memory.

Enable NUMA‑aware allocations.

Balance interrupts with irqbalance.

Disable or minimize swap; tune swappiness if swap is required.

Monitor cache hit rates and avoid O_DIRECT unless necessary.

Detect and fix memory leaks early with BCC tools.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

monitoringoptimizationLinuxCPUMemorytools
ITPUB
Written by

ITPUB

Official ITPUB account sharing technical insights, community news, and exciting events.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.