Operations 23 min read

How We Solved Repeated OOM Crashes in a Paimon‑RocksDB Service: A Deep Dive

The article recounts a series of three OOM incidents in a production service that combines Paimon data lake and RocksDB via an SDK, detailing the step‑by‑step investigation, the discovery of excessive bucket‑driven thread creation and off‑heap memory leaks, and the final mitigation measures that restored stability.

Alibaba Cloud Developer
Alibaba Cloud Developer
Alibaba Cloud Developer
How We Solved Repeated OOM Crashes in a Paimon‑RocksDB Service: A Deep Dive

Our service integrates the Paimon data lake with RocksDB, using an SDK for data queries and writes. Recently the system experienced three consecutive OOM (Out‑Of‑Memory) failures in the production environment. The investigation was winding; the author and teammates tried many approaches, made several detours, and finally pinpointed the root cause and resolved it.

This article extracts that "twisty" troubleshooting experience, sharing how we gradually approached the truth and ultimately solved the problem, hoping to inspire others using a similar tech stack.

1. Problem Discovery & Resolution

1.1 First OOM

Phenomenon

One morning an alarm indicated that a large number of RPC requests failed; the login service platform showed that all external RPC services were down.

Following standard practice, we first stopped the service to preserve the state for analysis, then restarted one machine and observed the other. In the monitoring metrics we noticed an abnormal spike.

Thread count spike
Thread count spike

The Java thread count surged at a specific time (the chart shows only a fragment; in reality the count spikes at several fixed times).

Investigation

The fixed times coincided with the hourly scheduled SDK writes to Paimon tables. We consulted the SDK team and discovered that the Paimon table, when no bucket number is specified, defaults to 100 buckets. The SDK creates a thread for each bucket during writes, resulting in table count × 100 threads, matching the observed Java thread count.

Resolution

After discussion we decided to reduce the bucket number. According to documentation the bucket count should be set as follows:

Small‑size (OLTP) scenarios: set a small bucket number, typically 4‑16, which can also improve query efficiency.

Large‑scale (high‑concurrency streaming) scenarios: set 64, 128, or 256 buckets.

In general, use a power‑of‑two bucket count:

bucketCount ≈ (expected max write parallelism) × N (where N is usually 1‑4)

After adjusting the bucket count and redeploying, the thread count fell into the desired range and the first OOM was resolved.

1.2 Second OOM

Phenomenon

After fixing the first OOM we strengthened alerts, but more than 20 days later the service alarmed again with the same symptom: all external RPC services went offline.

JVM metrics showed normal thread count but memory usage exceeded 95%.

Running dmesg | grep -i "killed process" revealed that the Java process had been killed.

Extending the memory‑usage query interval showed a slow, steady increase in memory utilization since the last restart.

Memory usage trend
Memory usage trend

Because the increase was gradual, we started a half‑month off‑heap memory leak investigation.

Heap Investigation

We first examined JVM heap metrics and confirmed that the heap was not leaking; the observed fluctuations were normal GC behavior. The heap max was 4 GB on an 8 GB machine, and the old generation usage stayed near zero.

Thus the root cause must be off‑heap memory.

Off‑Heap Investigation

Thread analysis after the bucket adjustment showed stable thread numbers, ruling out thread‑induced OOM.

We examined DirectMemory and JNIMemory usage and found that DirectMemory accounted for about 312 MB.

Using the internal MAT tool to analyze dump files, we discovered that the off‑heap memory was filled with java.nio.DirectByteBuffer objects.

DirectByteBuffer usage
DirectByteBuffer usage

Further analysis indicated that the RPC framework (Netty) could allocate off‑heap memory that is not visible to the monitoring system, causing a slow rise in memory usage.

We used jcmd <PID> VM.native_memory summary and jcmd <PID> VM.native_memory detail to inspect native allocations. The committed memory grew by about 57 MB over a day, which did not match the observed increase.

Using async‑profiler we captured a flame graph; the majority of CPU time was still spent in RocksDB.

Flame graph
Flame graph

We consulted the SDK team again; they confirmed that the current SDK version had a RocksDB JNI memory‑leak issue that could not release allocated memory.

1.3 Third OOM

Phenomenon

Even after applying several mitigation measures, memory utilization on one of the two production machines (Machine A) kept climbing in a stair‑step fashion, while Machine B remained stable.

Machine A memory
Machine A memory
Machine B memory
Machine B memory

The only difference between the machines was that most Paimon writes via RocksDB were executed on Machine A. The stair‑step increases coincided with the Paimon write times, leading us to again suspect the SDK.

2. Troubleshooting Tools

2.1 MAT (Memory Analyzer Tool)

MAT is a high‑performance Java heap analyzer that can quickly locate memory‑leak roots in .hprof dumps. Its main features include leak‑suspect reports, dominator trees, histograms, OQL queries, and GC‑root path analysis.

2.2 NMT (Native Memory Tracking)

NMT tracks native memory allocations such as Metaspace, thread stacks, JNI code, and internal JVM structures. Enable it with -XX:NativeMemoryTracking=detail (adds ~5‑10% overhead). Common commands:

# Summary view
jcmd <PID> VM.native_memory summary
# Detailed view (requires detail flag)
jcmd <PID> VM.native_memory detail
# Create baseline
jcmd <PID> VM.native_memory baseline
# Compare with baseline
jcmd <PID> VM.native_memory summary.diff
# Shut down NMT
jcmd <PID> VM.native_memory shutdown

2.3 Arthas

Arthas provides real‑time JVM introspection, including dashboard, jvm, memory, OGNl queries, and forced Full GC.

# View JVM memory, threads, GC
dashboard
# Show JVM parameters and memory pools
jvm
# Show memory pool usage
memory
# Execute OGNl expression
ognl 'com.example.CacheManager.cache.size()'
# Trigger Full GC
ognl '#[email protected]@getRuntime(), #runtime.gc()'

2.4 async‑profiler

async‑profiler is a low‑overhead sampling profiler for CPU, memory allocation, and lock contention. It can also profile native memory.

# Start native‑memory profiling
asprof start -e nativemem -f app.jrf <PID>
# Stop and generate flame graph
asprof stop -e nativemem -f app.jrf <PID> > app-leak.html

2.5 Linux Commands

Standard Linux tools such as top, pmap, and ps help observe process memory layout and identify suspicious regions.

3. Troubleshooting Thought Process

3.1 Preserve the Incident

When a problem occurs, keep the affected machine alive for analysis; do not restart immediately, as losing the state discards valuable clues.

3.2 Examine System Metrics

Use monitoring dashboards to compare heap, off‑heap, and overall memory trends, which helps narrow down the problem area.

3.3 Apply the Right Tools

If the issue is heap‑related, use MAT; for off‑heap problems, rely on NMT, async‑profiler, or gperftools.

3.4 Seek Expertise

Search the web for similar cases, consult internal experts, or use large language models to accelerate root‑cause identification.

3.5 Document and Share

Record the investigation steps and lessons learned for personal growth and to help teammates facing similar issues.

4. Final Solution

Although we identified the root cause as an off‑heap memory leak in the RocksDB‑based SDK, fixing it required collaboration with the JVM experts. The mitigation steps included:

Lower the JVM heap limit ( -Xmx) to reserve more physical memory for off‑heap usage.

Enable -XX:+AlwaysPreTouch so the JVM allocates the entire heap at startup, reducing apparent memory growth.

Increase the machine’s physical memory.

Upgrade the Netty shared SAR package to reduce Netty’s off‑heap footprint.

Ultimately we switched the data‑ingestion architecture: instead of writing to Paimon directly from the application (old architecture), we now send messages to Flink, which writes to Paimon (new architecture). Flink provides mature resource management, back‑pressure, state handling with exactly‑once semantics, and better scalability for lake storage.

References

Native Memory Tracking (NMT) documentation: https://docs.oracle.com/javase/8/docs/technotes/guides/troubleshoot/tooldescr007.html

Arthas command list: https://arthas.aliyun.com/doc/commands.html#jvm-%E7%9B%B8%E5%85%B3

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

JavaperformancetroubleshootingPaimonMemoryLeakOOMRocksDB
Alibaba Cloud Developer
Written by

Alibaba Cloud Developer

Alibaba's official tech channel, featuring all of its technology innovations.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.