Operations 29 min read

Memory Leak Postmortem: Combining free, smem, pmap, and perf for Effective Diagnosis

When a thumbnail service experienced sudden latency spikes and OOM kills shortly after a new release, the author walks through a systematic investigation using free, smem, pmap, and perf to distinguish true memory leaks from page‑cache or shared‑page artifacts, pinpoint the native decoder buffer issue, and outline remediation steps.

Raymond Ops

Jul 1, 2026

Memory Leak Postmortem: Combining free, smem, pmap, and perf for Effective Diagnosis

Overview

The incident occurred in a high‑traffic thumbnail generation service less than 90 minutes after a new version rollout. P99 latency jitter appeared, followed by a rapid drop in MemAvailable, OOM‑killed containers, and noisy side‑car logs. The common mistake is to restart the pod or look only at free -h output, which masks the real cause.

Tool Responsibilities

free : shows whether the whole machine is losing memory or if page cache is fluctuating.

smem : separates shared pages from private pages and provides PSS/USS metrics.

pmap : flattens the mapping types of a suspect process.

perf : aggregates allocation hotspots back to the call stack.

Investigation Chain

Verify that available is continuously decreasing while buff/cache stays stable – this proves a real memory loss rather than page‑cache churn.

Use smem -rs pss -k to identify which processes have abnormal USS/PSS growth. The two thumbnail workers showed USS rising above 4 GiB, indicating private memory consumption.

Run pmap -x <pid> (or smaps) to see the composition of the growth. The [ anon ] region and Private_Dirty values were increasing sharply.

Correlate the memory growth with business logs. Large TIFF requests surged at the same time, and the logs contained "decode image slow path" messages.

Capture the hot call stack with perf top -p <pid> -g --call-graph dwarf and perf record. The dominant symbols were allocate_decode_buffer and tiff_decode_tiles inside libimgdecode.so.

Root Cause

The new libimgdecode.so version introduced an error path in TIFF tile decoding that allocated intermediate buffers without releasing them. Under normal traffic the leak was invisible, but during the evening peak with many large images each worker accumulated anonymous private dirty pages, eventually exhausting node memory and triggering OOM kills.

Remediation and Bleeding‑Edge Actions

Rollback to the previous image version to stop further allocation.

Temporarily limit the thumbnail service concurrency and cap the maximum image dimensions.

Keep at least one affected pod running to preserve evidence (perf data, smaps, pmap).

After rollback, verify that MemAvailable stabilises, USS/PSS no longer shows monotonic growth, and OOMKilled events drop to zero.

Best Practices & Checklist

Never restart a pod before taking a full memory snapshot (free + vmstat + smem + pmap + perf).

Distinguish true leaks (continuous available drop + rising USS/Anonymous/Private_Dirty) from page‑cache or shared‑library artefacts.

Align container PIDs with host PIDs using crictl and nsenter to avoid analysing the wrong process.

Include business‑level signals (large‑image request rate, APM P99) in the evidence chain.

Automate periodic smem sampling to catch slow leaks before they become critical.

Monitoring Recommendations

Node‑level: MemAvailable ratio, PSI memory pressure, major page faults.

Pod‑level: working_set, OOMKilled count, restart count.

Process‑level: periodic USS/PSS and anonymous page metrics.

Business‑level: P99 latency and large‑image request proportion.

Reference Commands

# Quick snapshot
free -h
vmstat 1 5
smem -rs pss -k | head -20
pmap -x $PID | tail -20
journalctl -k --since '20 min ago' | tail -100

# Perf capture
perf record -F 99 -g -p $PID -- sleep 30
perf report --stdio | head -80

Following this disciplined workflow ensures that the root cause is reproducible, verifiable, and can be communicated clearly to developers for a permanent fix.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Kubernetes linux memory-leak troubleshooting performance-analysis perf

Written by

Raymond Ops

Linux ops automation, cloud-native, Kubernetes, SRE, DevOps, Python, Golang and related tech discussions.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.