How We Solved Repeated OOM Crashes in a Paimon‑RocksDB Service: A Deep Dive
The article recounts a series of three OOM incidents in a production service that combines Paimon data lake and RocksDB via an SDK, detailing the step‑by‑step investigation, the discovery of excessive bucket‑driven thread creation and off‑heap memory leaks, and the final mitigation measures that restored stability.
Our service integrates the Paimon data lake with RocksDB, using an SDK for data queries and writes. Recently the system experienced three consecutive OOM (Out‑Of‑Memory) failures in the production environment. The investigation was winding; the author and teammates tried many approaches, made several detours, and finally pinpointed the root cause and resolved it.
This article extracts that "twisty" troubleshooting experience, sharing how we gradually approached the truth and ultimately solved the problem, hoping to inspire others using a similar tech stack.
1. Problem Discovery & Resolution
1.1 First OOM
Phenomenon
One morning an alarm indicated that a large number of RPC requests failed; the login service platform showed that all external RPC services were down.
Following standard practice, we first stopped the service to preserve the state for analysis, then restarted one machine and observed the other. In the monitoring metrics we noticed an abnormal spike.
The Java thread count surged at a specific time (the chart shows only a fragment; in reality the count spikes at several fixed times).
Investigation
The fixed times coincided with the hourly scheduled SDK writes to Paimon tables. We consulted the SDK team and discovered that the Paimon table, when no bucket number is specified, defaults to 100 buckets. The SDK creates a thread for each bucket during writes, resulting in table count × 100 threads, matching the observed Java thread count.
Resolution
After discussion we decided to reduce the bucket number. According to documentation the bucket count should be set as follows:
Small‑size (OLTP) scenarios: set a small bucket number, typically 4‑16, which can also improve query efficiency.
Large‑scale (high‑concurrency streaming) scenarios: set 64, 128, or 256 buckets.
In general, use a power‑of‑two bucket count:
bucketCount ≈ (expected max write parallelism) × N (where N is usually 1‑4)
After adjusting the bucket count and redeploying, the thread count fell into the desired range and the first OOM was resolved.
1.2 Second OOM
Phenomenon
After fixing the first OOM we strengthened alerts, but more than 20 days later the service alarmed again with the same symptom: all external RPC services went offline.
JVM metrics showed normal thread count but memory usage exceeded 95%.
Running dmesg | grep -i "killed process" revealed that the Java process had been killed.
Extending the memory‑usage query interval showed a slow, steady increase in memory utilization since the last restart.
Because the increase was gradual, we started a half‑month off‑heap memory leak investigation.
Heap Investigation
We first examined JVM heap metrics and confirmed that the heap was not leaking; the observed fluctuations were normal GC behavior. The heap max was 4 GB on an 8 GB machine, and the old generation usage stayed near zero.
Thus the root cause must be off‑heap memory.
Off‑Heap Investigation
Thread analysis after the bucket adjustment showed stable thread numbers, ruling out thread‑induced OOM.
We examined DirectMemory and JNIMemory usage and found that DirectMemory accounted for about 312 MB.
Using the internal MAT tool to analyze dump files, we discovered that the off‑heap memory was filled with java.nio.DirectByteBuffer objects.
Further analysis indicated that the RPC framework (Netty) could allocate off‑heap memory that is not visible to the monitoring system, causing a slow rise in memory usage.
We used jcmd <PID> VM.native_memory summary and jcmd <PID> VM.native_memory detail to inspect native allocations. The committed memory grew by about 57 MB over a day, which did not match the observed increase.
Using async‑profiler we captured a flame graph; the majority of CPU time was still spent in RocksDB.
We consulted the SDK team again; they confirmed that the current SDK version had a RocksDB JNI memory‑leak issue that could not release allocated memory.
1.3 Third OOM
Phenomenon
Even after applying several mitigation measures, memory utilization on one of the two production machines (Machine A) kept climbing in a stair‑step fashion, while Machine B remained stable.
The only difference between the machines was that most Paimon writes via RocksDB were executed on Machine A. The stair‑step increases coincided with the Paimon write times, leading us to again suspect the SDK.
2. Troubleshooting Tools
2.1 MAT (Memory Analyzer Tool)
MAT is a high‑performance Java heap analyzer that can quickly locate memory‑leak roots in .hprof dumps. Its main features include leak‑suspect reports, dominator trees, histograms, OQL queries, and GC‑root path analysis.
2.2 NMT (Native Memory Tracking)
NMT tracks native memory allocations such as Metaspace, thread stacks, JNI code, and internal JVM structures. Enable it with -XX:NativeMemoryTracking=detail (adds ~5‑10% overhead). Common commands:
# Summary view
jcmd <PID> VM.native_memory summary
# Detailed view (requires detail flag)
jcmd <PID> VM.native_memory detail
# Create baseline
jcmd <PID> VM.native_memory baseline
# Compare with baseline
jcmd <PID> VM.native_memory summary.diff
# Shut down NMT
jcmd <PID> VM.native_memory shutdown2.3 Arthas
Arthas provides real‑time JVM introspection, including dashboard, jvm, memory, OGNl queries, and forced Full GC.
# View JVM memory, threads, GC
dashboard
# Show JVM parameters and memory pools
jvm
# Show memory pool usage
memory
# Execute OGNl expression
ognl 'com.example.CacheManager.cache.size()'
# Trigger Full GC
ognl '#[email protected]@getRuntime(), #runtime.gc()'2.4 async‑profiler
async‑profiler is a low‑overhead sampling profiler for CPU, memory allocation, and lock contention. It can also profile native memory.
# Start native‑memory profiling
asprof start -e nativemem -f app.jrf <PID>
# Stop and generate flame graph
asprof stop -e nativemem -f app.jrf <PID> > app-leak.html2.5 Linux Commands
Standard Linux tools such as top, pmap, and ps help observe process memory layout and identify suspicious regions.
3. Troubleshooting Thought Process
3.1 Preserve the Incident
When a problem occurs, keep the affected machine alive for analysis; do not restart immediately, as losing the state discards valuable clues.
3.2 Examine System Metrics
Use monitoring dashboards to compare heap, off‑heap, and overall memory trends, which helps narrow down the problem area.
3.3 Apply the Right Tools
If the issue is heap‑related, use MAT; for off‑heap problems, rely on NMT, async‑profiler, or gperftools.
3.4 Seek Expertise
Search the web for similar cases, consult internal experts, or use large language models to accelerate root‑cause identification.
3.5 Document and Share
Record the investigation steps and lessons learned for personal growth and to help teammates facing similar issues.
4. Final Solution
Although we identified the root cause as an off‑heap memory leak in the RocksDB‑based SDK, fixing it required collaboration with the JVM experts. The mitigation steps included:
Lower the JVM heap limit ( -Xmx) to reserve more physical memory for off‑heap usage.
Enable -XX:+AlwaysPreTouch so the JVM allocates the entire heap at startup, reducing apparent memory growth.
Increase the machine’s physical memory.
Upgrade the Netty shared SAR package to reduce Netty’s off‑heap footprint.
Ultimately we switched the data‑ingestion architecture: instead of writing to Paimon directly from the application (old architecture), we now send messages to Flink, which writes to Paimon (new architecture). Flink provides mature resource management, back‑pressure, state handling with exactly‑once semantics, and better scalability for lake storage.
References
Native Memory Tracking (NMT) documentation: https://docs.oracle.com/javase/8/docs/technotes/guides/troubleshoot/tooldescr007.html
Arthas command list: https://arthas.aliyun.com/doc/commands.html#jvm-%E7%9B%B8%E5%85%B3
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Alibaba Cloud Developer
Alibaba's official tech channel, featuring all of its technology innovations.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
