How to Diagnose Linux Memory Issues: Metrics, Tools, and Step-by-Step Strategies
This guide explains essential Linux memory metrics, demonstrates how to use tools like free, top, vmstat, sar, and ps, and outlines a systematic, layered approach to pinpointing and resolving memory problems, including cache analysis and leak detection.
1. Introduction
This article presents a practical, step‑by‑step workflow for diagnosing memory‑related issues on Linux systems. It builds on basic concepts such as memory fundamentals, swap usage, and cache types, and then shows how to translate metric observations into concrete troubleshooting actions.
2. Key Memory Metrics
System‑level metrics that should be monitored include:
Total memory, free memory, used memory, and buffer/cache.
Cache hit rate and page‑fault counters.
Reclaimable slab memory (SReclaimable) and non‑reclaimable slab memory (SUnreclaim).
Process‑level metrics :
Virtual memory (VIRT) – address space allocated.
Resident memory (RES or RSS) – physical pages actually in RAM.
Shared memory (SHR).
Memory‑usage percentage (%MEM).
Exclusive memory ≈ RES‑SHR.
Virtual size (VSZ) in kilobytes.
Swap metrics :
Total swap, used swap, free swap.
Swap‑in and swap‑out rates.
3. Essential Tools
Below are the most common commands for retrieving the metrics described above.
root@test:~# free -wh
total used free shared buffers cache available
Mem: 1.0Ti 48Gi 803Gi 3.0Mi 2.0Gi 152Gi 954Gi
Swap: 39Gi 0B 39Gitotal – total physical memory. used – memory in use (total‑free‑buffers‑cache). free – unused memory. shared – memory shared between processes. buffers – kernel buffers. cache – page cache + slab. available – estimate of memory that can be allocated to new applications without swapping.
root@test:~# top - 11:34:44 up 526 days, 20:11, 1 user, load average: 0.84, 0.92, 1.04
Tasks: 1343 total, 1 running, 1342 sleeping, 0 stopped, 0 zombie
%Cpu(s): 0.4 us, 0.0 sy, 0.0 ni, 99.6 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st
MiB Mem : 1031349.+total, 823056.8 free, 50146.1 used, 158146.7 buff/cache
MiB Swap: 40960.0 total, 40960.0 free, 0.0 used. 976896.4 avail Mem
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
4095 mysql 20 0 6189264 1.6g 550648 S 21.5 0.2 204441:48 mysqld
2732 root 20 0 141524 34936 9324 S 17.2 0.0 194545:07 node_exporter
1129876 root 20 0 7095304 1.1g 54208 S 5.6 0.1 1770:30 prometheus
2936909 root 20 0 5411276 231104 56112 S 2.0 0.0 19108:45 wsssr_defence_sPID – process identifier. USER – process owner. PR – priority. NI – nice value. VIRT – virtual memory (KiB). RES – resident memory (KiB). SHR – shared memory (KiB). S – process state (S=sleep, R=run, etc.). %CPU – CPU usage. %MEM – memory usage. TIME+ – cumulative CPU time. COMMAND – command name.
root@test:~# ps -aux | head -5
USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND
root 1 0.1 0.0 171128 13972 ? Ss 2024 1505:02 /sbin/init nopti
root 2 0.0 0.0 0 0 ? S 2024 8:41 [kthreadd]
root 3 0.0 0.0 0 0 ? I< 2024 0:00 [rcu_gp]
root 4 0.0 0.0 0 0 ? I< 2024 0:00 [rcu_par_gp]USER – process owner. PID – process ID. %CPU – CPU share. %MEM – memory share. VSZ – virtual size (KB). RSS – resident size (KB). TTY – controlling terminal. STAT – process state. START – start time. TIME – total CPU time. COMMAND – command line.
root@test:~# cat /proc/meminfo
...
Slab: 24081144 kB
SReclaimable: 18618148 kB
SUnreclaim: 5462996 kB
...
AnonHugePages: 3555328 kB
...
HugePages_Total: 0
HugePages_Free: 0
HugePages_Rsvd: 0
HugePages_Surp: 0
Hugepagesize: 2048 kBSReclaimable – reclaimable slab memory. SUnreclaim – non‑reclaimable slab memory. AnonHugePages – memory used by transparent huge pages. HugePages_Total / HugePages_Free – configured huge‑page pool. Hugepagesize – size of each huge page (2 MiB).
root@test:~# vmstat 1 2
procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu-----
r b swpd free buff cache si so bi bo in cs us sy id wa st
0 0 0 842544896 2081680 160133632 0 0 0 4 0 0 1 0 99 0 0
2 0 0 842544192 2081680 160133744 0 0 0 92 3923 5803 2 0 98 0 0r – runnable processes. b – blocked processes. swpd – used swap (KB). free – free memory (KB). buff – buffer memory (KB). cache – cache memory (KB). si/so – swap in/out (KB/s). bi/bo – block I/O (blocks/s). in – interrupts per second. cs – context switches per second. us/sy/id/wa/st – CPU usage breakdown.
root@test:~# sar -r 1 3
Linux 5.4.0-59-generic (jnai1asan01) 11/05/2025 _x86_64_ (128 CPU)
02:36:42 PM kbmemfree kbavail kbmemused %memused kbbuffers kbcached kbcommit %commit kbactive kbinact kbdirty
02:36:43 PM 842489592 1000321788 45882776 4.34 2081692 141547176 53245784 4.85 142730504 44375344 1200
02:36:44 PM 842489340 1000321652 45883056 4.34 2081692 141547236 53245784 4.85 142730656 44375460 1260
02:36:45 PM 842489772 1000322360 45882228 4.34 2081692 141547508 53245784 4.85 142730568 44375676 1536kbmemfree – free physical memory (KB). kbavail – estimated available memory (KB). kbmemused – used physical memory (KB). %memused – usage percent. kbbuffers – kernel buffers (KB). kbcached – cached data (KB). kbcommit – memory required by current workload (KB). %commit – commit ratio. kbactive – recently used memory (KB). kbinact – inactive memory (KB). kbdirty – dirty pages awaiting writeback (KB).
Additional utilities that complement the above commands are cachetop (cache‑hit rates), slabtop (slab usage), pmap (per‑process memory map), and leak‑detection tools memleak‑bpfcc and valgrind.
4. Systematic Memory‑Problem Troubleshooting Flow
Initial assessment : Run free or top to obtain a high‑level view. If overall free memory is low, determine whether the drop is caused by cache consumption. Use vmstat or sar to observe cache trends over time.
Cache investigation : When cache grows unexpectedly, employ cachetop or slabtop and examine /proc/meminfo to identify which slabs or processes dominate the cache. Then run pmap <pid> on the offending process for a detailed breakdown.
Swap analysis : Verify swap activity with free and vmstat. Persistent swap‑in/out indicates either insufficient RAM or a process whose memory usage is rapidly increasing. Correlate swap metrics with the process list from top or ps to pinpoint the culprit.
Leak detection : If memory usage continues to rise and OOM events occur, suspect a memory leak. Use memleak‑bpfcc (eBPF‑based) or valgrind --leak-check=full on the suspect process to locate leaking allocations and call stacks.
Hardware or external factors : Occasionally memory pressure originates from hardware anomalies (e.g., a faulty NIC generating excessive buffers) or disk‑I/O problems that force processes to allocate more memory. Cross‑check memory symptoms with CPU, disk, and network metrics before concluding the root cause.
This layered methodology enables rapid narrowing of the root cause and selection of the appropriate remediation technique.
Tech Stroll Journey
The philosophy behind "Stroll": continuous learning, curiosity‑driven, and practice‑focused.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
