Operations 11 min read

How to Uncover Hidden Java Memory Leaks in Kubernetes Pods with Alibaba Cloud OS Console

When migrating automotive workloads to cloud-native containers, unexpected OOMKilled pods often hide a large amount of Java memory consumption caused by JNI, libc, and Transparent Huge Pages, which can be identified and resolved using the Alibaba Cloud OS Console's memory panorama analysis and hotspot tracing features.

Alibaba Cloud Developer
Alibaba Cloud Developer
Alibaba Cloud Developer
How to Uncover Hidden Java Memory Leaks in Kubernetes Pods with Alibaba Cloud OS Console

Background

When migrating Java workloads from traditional on‑premise IDC clusters to cloud‑native Kubernetes (ACK) clusters, many pods are terminated with OOMKilled events even though JVM heap usage reported by standard metrics appears modest.

Why Pod Memory Exceeds JVM Metrics

Container RSS (resident set size) includes not only the JVM heap but also off‑heap structures, native allocations, and OS‑level overhead.

A portion of the memory cannot be attributed to any visible Java component, often called “missing” memory.

The discrepancy typically appears after changing the operating system or container runtime, even when the JDK version remains unchanged.

Java Process Memory Composition

JVM Heap : Size controlled by -Xms / -Xmx; observable via MemoryMXBean or JMX tools.

JVM Off‑Heap : Includes Metaspace, compressed class space, code cache, direct buffers, and thread stacks. These can be limited with flags such as -XX:MaxMetaspaceSize, -XX:CompressedClassSpaceSize, -XX:ReservedCodeCacheSize, -XX:MaxDirectMemorySize, and -Xss.

JNI Native Memory : Allocated by native libraries invoked through the Java Native Interface (e.g., ZLIB compression). Allocation is performed with C functions like malloc or system calls brk / mmap, and is invisible to most JVM monitoring tools.

Hidden Memory Black Holes

JNI Memory

JNI allocations can consume hundreds of megabytes. Common culprits include native libraries such as ZLIB that leak memory when used improperly.

glibc (LIBC) Overhead

glibc’s ptmalloc allocator creates a separate 64 MiB arena for each thread. Large arenas, top‑chunk fragmentation, bin caching, and delayed release of memory back to the OS can cause significant RSS growth that is not reflected in JVM‑level metrics.

Transparent Huge Pages (THP)

Linux THP merges 4 KiB pages into 2 MiB huge pages to reduce TLB misses. If an application reserves a 2 MiB region but uses only a few kilobytes, the entire huge page remains allocated, inflating the process RSS.

Diagnostic Workflow

When a pod approaches its memory limit, trigger a memory‑panorama analysis. The console displays RSS, WorkingSet, JVM memory, and a breakdown of process‑level memory usage.

Inspect the Java memory analysis report to identify the contribution of JNI memory, which often dominates the excess.

Enable JNI memory profiling to generate a flame‑graph of native allocation call stacks.

Correlate the flame‑graph with Java CPU hotspot traces to pinpoint which JIT‑compiled code paths (e.g., the C2 compiler) trigger the native allocations.

Findings

The extra ~570 MiB of process memory was traced to JNI allocations originating from the C2 compiler JIT phase. glibc arena fragmentation and THP further amplified the memory footprint.

Mitigation Strategies

Tune C2 compiler parameters (e.g., -XX:CompileThreshold, -XX:InlineSmallCode) to adopt a more conservative compilation strategy, reducing JIT‑induced native allocations.

Adjust the glibc environment variable MALLOC_TRIM_THRESHOLD_ (or related tunables) to encourage timely return of freed memory to the OS.

References

Memory Panorama Analysis (Alibaba Cloud OS Console): https://help.aliyun.com/zh/alinux/user-guide/memory-panorama-analysis-function-instructions

JNI memory leak example (ZLIB): https://bugs.openjdk.org/browse/JDK-8257032

glibc 64 MiB arena waste: https://bugs.openjdk.org/browse/JDK-8193521

glibc top‑chunk / fast‑bin retention: https://wenfh2020.com/2021/04/08/glibc-memory-leak/

THP‑induced memory bloat in Go (relevant to native allocation): https://github.com/golang/go/issues/64332

JavaoperationsobservabilityKubernetesMemory LeakOOMAlibaba Cloudjni
Alibaba Cloud Developer
Written by

Alibaba Cloud Developer

Alibaba's official tech channel, featuring all of its technology innovations.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.