Big Data 13 min read

Analysis and Resolution of a FileSystem‑Induced Memory Leak Causing OOM in Production

The article details how repeatedly calling FileSystem.get(uri, conf, user) created distinct UserGroupInformation objects, inflating the static FileSystem cache and causing a heap‑memory leak that triggered an Out‑Of‑Memory error, and explains that using the two‑argument get method or explicitly closing instances resolves the issue.

vivo Internet Technology
vivo Internet Technology
vivo Internet Technology
Analysis and Resolution of a FileSystem‑Induced Memory Leak Causing OOM in Production

This article presents a complete case study of a memory‑leak problem triggered by the FileSystem class, which eventually led to an Out‑Of‑Memory (OOM) error in a production service.

Memory‑leak and OOM definitions : A memory leak occurs when an object that is no longer used remains allocated, preventing the JVM from reclaiming its space. Accumulated leaks cause memory‑overflow (OOM), where the JVM cannot allocate more memory and throws an OOM error.

Background : During a weekend, the service generated CPU usage alerts (>80%) and frequent Full GC warnings. Monitoring showed a simultaneous spike in CPU usage and Full GC frequency, suggesting that GC activity was driving the CPU alarm.

Problem discovery :

Monitoring revealed that both CPU usage and Full GC counts rose sharply at the same time.

Heap analysis showed a growing old‑generation memory region that never reclaimed, indicating a leak.

The OOM log confirmed that the root cause was a memory leak.

Investigation steps :

Dumped the heap and loaded it into Eclipse Memory Analyzer (MAT). The Leak Suspects view highlighted a org.apache.hadoop.conf.Configuration object occupying ~1.8 GB (≈78 % of the heap).

Further analysis traced the large object back to a HashMap inside FileSystem.Cache , which stores FileSystem instances.

Examined the source of FileSystem . The class provides two overloaded get methods:

public static FileSystem get(final URI uri, final Configuration conf, final String user)
public static FileSystem get(URI uri, Configuration conf)

The three‑argument version creates a new UserGroupInformation and Subject each time, causing a unique Cache.Key (its hashCode depends on UserGroupInformation.hashCode() , which in turn uses System.identityHashCode(subject) ). Consequently, every call generates a distinct cache key, leading to repeated entries in the static cache and preventing reclamation.

In contrast, the two‑argument version reuses the static login user, so the cache key remains constant and the cache works as intended.

Root‑cause summary :

Each call to FileSystem.get(uri, conf, user) creates new UserGroupInformation and Subject objects.

The cache key’s hash code varies, causing the cache to store many duplicate FileSystem objects.

The growing cache eventually exhausts heap memory, resulting in OOM.

Solutions :

Prefer the two‑argument FileSystem.get(uri, conf) method, which leverages the static cache. Set the user via System.setProperty("HADOOP_USER_NAME", "hive") if needed. The default fs.automatic.close=true ensures connections close via a shutdown hook.

If the three‑argument method must be used, ensure that only one FileSystem instance exists per HDFS URI, or explicitly call close() after each use to remove the instance from the cache.

Both approaches were tested; the team chose the second (explicit close() ) with minimal code changes.

Results : After deploying the fix, the old‑generation memory reclaimed normally, and the OOM issue disappeared, as shown by the post‑fix monitoring graphs.

Conclusion : Memory leaks are a common cause of OOM in Java applications. The article outlines a systematic workflow: generate heap dumps ( -XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=/usr/local/base ), analyze with tools like Eclipse MAT or VisualVM, locate the leaking code, modify it, and redeploy. It also notes other OOM triggers such as oversized objects, insufficient heap size, or infinite loops.

JavaMemory LeakHadoopPerformance DebuggingFilesystemOutOfMemory
vivo Internet Technology
Written by

vivo Internet Technology

Sharing practical vivo Internet technology insights and salon events, plus the latest industry news and hot conferences.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.