Operations 16 min read

How to Diagnose Linux Memory Issues: Metrics, Tools, and Step-by-Step Strategies

This guide explains essential Linux memory metrics, demonstrates how to use tools like free, top, vmstat, sar, and ps, and outlines a systematic, layered approach to pinpointing and resolving memory problems, including cache analysis and leak detection.

Tech Stroll Journey

Nov 5, 2025

How to Diagnose Linux Memory Issues: Metrics, Tools, and Step-by-Step Strategies

1. Introduction

This article presents a practical, step‑by‑step workflow for diagnosing memory‑related issues on Linux systems. It builds on basic concepts such as memory fundamentals, swap usage, and cache types, and then shows how to translate metric observations into concrete troubleshooting actions.

2. Key Memory Metrics

System‑level metrics that should be monitored include:

Total memory, free memory, used memory, and buffer/cache.

Cache hit rate and page‑fault counters.

Reclaimable slab memory (SReclaimable) and non‑reclaimable slab memory (SUnreclaim).

Process‑level metrics :

Virtual memory (VIRT) – address space allocated.

Resident memory (RES or RSS) – physical pages actually in RAM.

Shared memory (SHR).

Memory‑usage percentage (%MEM).

Exclusive memory ≈ RES‑SHR.

Virtual size (VSZ) in kilobytes.

Swap metrics :

Total swap, used swap, free swap.

Swap‑in and swap‑out rates.

3. Essential Tools

Below are the most common commands for retrieving the metrics described above.

root@test:~# free -wh
               total        used        free      shared     buffers       cache   available
Mem:          1.0Ti        48Gi       803Gi       3.0Mi        2.0Gi       152Gi       954Gi
Swap:         39Gi          0B        39Gi

total – total physical memory. used – memory in use (total‑free‑buffers‑cache). free – unused memory. shared – memory shared between processes. buffers – kernel buffers. cache – page cache + slab. available – estimate of memory that can be allocated to new applications without swapping.

root@test:~# top - 11:34:44 up 526 days, 20:11,  1 user,  load average: 0.84, 0.92, 1.04
Tasks: 1343 total,   1 running, 1342 sleeping,   0 stopped,   0 zombie
%Cpu(s):  0.4 us,  0.0 sy,  0.0 ni, 99.6 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
MiB Mem : 1031349.+total, 823056.8 free,  50146.1 used, 158146.7 buff/cache
MiB Swap:  40960.0 total,  40960.0 free,      0.0 used. 976896.4 avail Mem

    PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND
   4095 mysql     20   0 6189264   1.6g 550648 S  21.5   0.2 204441:48 mysqld
   2732 root      20   0  141524  34936   9324 S  17.2   0.0 194545:07 node_exporter
1129876 root      20   0 7095304   1.1g  54208 S   5.6   0.1   1770:30 prometheus
2936909 root      20   0 5411276 231104  56112 S   2.0   0.0  19108:45 wsssr_defence_s

PID – process identifier. USER – process owner. PR – priority. NI – nice value. VIRT – virtual memory (KiB). RES – resident memory (KiB). SHR – shared memory (KiB). S – process state (S=sleep, R=run, etc.). %CPU – CPU usage. %MEM – memory usage. TIME+ – cumulative CPU time. COMMAND – command name.

root@test:~# ps -aux | head -5
USER         PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
root           1  0.1  0.0 171128 13972 ?        Ss   2024   1505:02 /sbin/init nopti
root           2  0.0  0.0      0     0 ?        S    2024    8:41 [kthreadd]
root           3  0.0  0.0      0     0 ?        I<   2024    0:00 [rcu_gp]
root           4  0.0  0.0      0     0 ?        I<   2024    0:00 [rcu_par_gp]

USER – process owner. PID – process ID. %CPU – CPU share. %MEM – memory share. VSZ – virtual size (KB). RSS – resident size (KB). TTY – controlling terminal. STAT – process state. START – start time. TIME – total CPU time. COMMAND – command line.

root@test:~# cat /proc/meminfo
...
Slab:            24081144 kB
SReclaimable:    18618148 kB
SUnreclaim:       5462996 kB
...
AnonHugePages:   3555328 kB
...
HugePages_Total:        0
HugePages_Free:         0
HugePages_Rsvd:         0
HugePages_Surp:         0
Hugepagesize:        2048 kB

SReclaimable – reclaimable slab memory. SUnreclaim – non‑reclaimable slab memory. AnonHugePages – memory used by transparent huge pages. HugePages_Total / HugePages_Free – configured huge‑page pool. Hugepagesize – size of each huge page (2 MiB).

root@test:~# vmstat 1 2
procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu-----
 r  b   swpd   free   buff  cache   si   so   bi   bo   in   cs us sy id wa st
 0  0      0 842544896 2081680 160133632    0    0     0     4   0   0  1  0 99  0  0
 2  0      0 842544192 2081680 160133744    0    0     0    92 3923 5803  2  0 98  0  0

r – runnable processes. b – blocked processes. swpd – used swap (KB). free – free memory (KB). buff – buffer memory (KB). cache – cache memory (KB). si/so – swap in/out (KB/s). bi/bo – block I/O (blocks/s). in – interrupts per second. cs – context switches per second. us/sy/id/wa/st – CPU usage breakdown.

root@test:~# sar -r 1 3
Linux 5.4.0-59-generic (jnai1asan01)   11/05/2025      _x86_64_       (128 CPU)

02:36:42 PM kbmemfree   kbavail kbmemused  %memused kbbuffers  kbcached  kbcommit   %commit  kbactive   kbinact   kbdirty
02:36:43 PM 842489592 1000321788  45882776      4.34   2081692 141547176  53245784      4.85 142730504  44375344      1200
02:36:44 PM 842489340 1000321652  45883056      4.34   2081692 141547236  53245784      4.85 142730656  44375460      1260
02:36:45 PM 842489772 1000322360  45882228      4.34   2081692 141547508  53245784      4.85 142730568  44375676      1536

kbmemfree – free physical memory (KB). kbavail – estimated available memory (KB). kbmemused – used physical memory (KB). %memused – usage percent. kbbuffers – kernel buffers (KB). kbcached – cached data (KB). kbcommit – memory required by current workload (KB). %commit – commit ratio. kbactive – recently used memory (KB). kbinact – inactive memory (KB). kbdirty – dirty pages awaiting writeback (KB).

Additional utilities that complement the above commands are cachetop (cache‑hit rates), slabtop (slab usage), pmap (per‑process memory map), and leak‑detection tools memleak‑bpfcc and valgrind.

4. Systematic Memory‑Problem Troubleshooting Flow

Initial assessment : Run free or top to obtain a high‑level view. If overall free memory is low, determine whether the drop is caused by cache consumption. Use vmstat or sar to observe cache trends over time.

Cache investigation : When cache grows unexpectedly, employ cachetop or slabtop and examine /proc/meminfo to identify which slabs or processes dominate the cache. Then run pmap <pid> on the offending process for a detailed breakdown.

Swap analysis : Verify swap activity with free and vmstat. Persistent swap‑in/out indicates either insufficient RAM or a process whose memory usage is rapidly increasing. Correlate swap metrics with the process list from top or ps to pinpoint the culprit.

Leak detection : If memory usage continues to rise and OOM events occur, suspect a memory leak. Use memleak‑bpfcc (eBPF‑based) or valgrind --leak-check=full on the suspect process to locate leaking allocations and call stacks.

Hardware or external factors : Occasionally memory pressure originates from hardware anomalies (e.g., a faulty NIC generating excessive buffers) or disk‑I/O problems that force processes to allocate more memory. Cross‑check memory symptoms with CPU, disk, and network metrics before concluding the root cause.

This layered methodology enables rapid narrowing of the root cause and selection of the appropriate remediation technique.

Metrics Linux Troubleshooting Tools