Operations 30 min read

How to Diagnose and Fix Memory Leaks in a Containerized Image Thumbnail Service

This guide walks through a systematic, step‑by‑step process for identifying, analyzing, and resolving memory‑related incidents in a high‑traffic thumbnail generation service running in Kubernetes, covering everything from initial symptom checks with free and vmstat to deep dives using smem, pmap, smaps, perf, and post‑mortem verification.

MaGe Linux Operations

Mar 10, 2026

How to Diagnose and Fix Memory Leaks in a Containerized Image Thumbnail Service

Incident Overview

A new version of an image thumbnail service caused a rapid increase in P99 latency, a drop in MemAvailable, and intermittent OOMKilled events within 90 minutes of deployment. The post‑mortem focuses on a systematic memory‑diagnostic workflow using four key tools: free – assess overall system memory health. smem – separate private ( USS / PSS) from shared memory. pmap / smaps – identify which mapping types (anonymous, file‑backed, shared) are growing. perf – locate allocation hotspots in the call stack.

Step‑by‑Step Diagnostic Procedure

1. Confirm System‑Wide Memory Pressure

Run free -h and verify that available continuously declines while buff/cache stays stable. If available is steady, the issue is likely cache‑related, not a leak.

Check kernel pressure signals: cat /proc/pressure/memory A rising some and full average indicates genuine memory pressure.

2. Align Node and Pod Views

kubectl top pod -n media
kubectl get events -n media --sort-by=.lastTimestamp

Correlate pod‑level OOMKilled events with node‑wide MemAvailable trends.

3. Distinguish Page‑Cache from Real Leak

If used is high but available remains stable, suspect page cache.

If available falls and buff/cache does not grow, proceed to process‑level analysis.

4. Identify the Real Memory Consumer with smem

smem -P thumbnail-svc -c "pid pss uss rss" | head -20

Focus on processes where USS (or PSS) is large and increasing. Sample repeatedly (e.g., every 30 s for several minutes) to ensure the growth is sustained, not a transient spike.

5. Drill Down into Mapping Types

For the suspect PID, inspect anonymous and dirty pages: pmap -x $PID | tail -20 Look for lines with [ anon ] and high Dirty values – these indicate private anonymous pages.

Aggregate smaps fields to confirm:

grep -E '^(Size|Rss|Pss|Private_Dirty|Anonymous):' /proc/$PID/smaps |\
awk '{sum[$1]+=$2} END {for (k in sum) printf "%s=%dKB
", k, sum[k]}'

Dominant Anonymous and Private_Dirty values point to a true leak in user‑space allocations.

6. Pinpoint Allocation Hotspots with perf

# Quick view
perf top -p $PID -g --call-graph dwarf
# Record a short session (e.g., 30 s)
perf record -F 99 -g -p $PID -- sleep 30
perf report --stdio | head -80

Typical hotspots such as allocate_decode_buffer and tiff_decode_tiles in libimgdecode.so indicate the native decoder is allocating memory without releasing it on error paths.

7. Root‑Cause Confirmation

The leak originates from a new libimgdecode.so version that fails to free intermediate buffers when decoding large TIFF images. Each worker accumulates anonymous private dirty pages, causing per‑container OOMKilled events and node‑wide memory exhaustion.

8. Immediate Mitigation

Roll back the deployment to the previous image.

Throttle large‑image requests (reduce concurrency, limit dimensions).

Drain the most affected node while keeping one instance for continued evidence collection.

Preserve a failing pod to keep perf and smaps data.

9. Post‑Mitigation Validation

Verify free -h shows stable MemAvailable for >1 hour.

Confirm smem USS/PSS no longer shows monotonic growth.

Run the high‑resolution image workload and ensure no OOMKilled events.

Check that perf no longer reports allocation hotspots.

Automation Scripts

Several helper scripts are provided to standardize evidence collection (e.g., mem_scene_collect.sh, smaps_rollup.py, pmap_growth_diff.sh, oom_evidence_pack.sh, cgroup_mem_snapshot.sh, perf_capture_prepare.sh). They capture free, vmstat, smem, pmap, smaps, kernel logs, cgroup metrics, and short perf recordings.

Best Practices & Pitfalls

Never rely solely on RSS; always examine PSS / USS for true private usage.

High buff/cache does not guarantee safety – check available and pressure metrics.

In container environments, map container PID to host PID before using pmap or perf.

When cgroup v2 is enabled, monitor memory.current and memory.events alongside pod metrics.

Limit perf recording duration in production to avoid overhead.

Monitoring Recommendations

Node layer: alert if MemAvailable < 8 % of total for 10 min.

Pod layer: alert on any OOMKilled event within 15 min.

Process layer: periodically sample USS/PSS; flag monotonic increase.

Business layer: correlate P99 latency spikes with large‑image request ratio.

Summary

The workflow—starting with system‑wide health ( free, vmstat), narrowing to process‑level private memory ( smem), dissecting mapping types ( pmap / smaps), and finally pinpointing allocation hotspots ( perf)—enables rapid discrimination between true memory leaks, cache effects, and shared‑page artifacts. Applying the mitigation steps, validating with repeat measurements, and embedding the recommended alerts creates a repeatable on‑call process that prevents recurrence.

Memory Leak perf smem

Written by

MaGe Linux Operations

Founded in 2009, MaGe Education is a top Chinese high‑end IT training brand. Its graduates earn 12K+ RMB salaries, and the school has trained tens of thousands of students. It offers high‑pay courses in Linux cloud operations, Python full‑stack, automation, data analysis, AI, and Go high‑concurrency architecture. Thanks to quality courses and a solid reputation, it has talent partnerships with numerous internet firms.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.