Why a Tiny RPC Change Crashed Our Service: 4 GB OOM Bug Explained
A seemingly harmless RPC framework bug caused a 4 GB byte array allocation, leading to repeated OutOfMemoryErrors in service B after service A’s deployment, and the article walks through the diagnosis, root‑cause analysis, and a simple fix.
Case Overview
Online systems often encounter OutOfMemory (OOM) errors not because of business code but due to bugs in underlying open‑source components.
System Architecture
Services communicate via an RPC framework built on a custom wrapper.
Incident
Service A was updated and redeployed; shortly after, service B crashed with OOM despite never having this issue before.
Log inspection on service B showed a java.lang.OutOfMemoryError Java heap space exception.
Initial Diagnosis
Reviewing the logs revealed that the OOM originated from the self‑developed RPC framework during request handling.
Memory Snapshot Analysis
Using MAT, the largest object in the heap was a massive byte[] array occupying the entire 4 GB heap. The array was allocated inside the RPC framework.
Source Code Analysis
Identify the component causing OOM by checking logs; often a framework like Tomcat, Jetty, or a custom RPC library.
Use heap analysis tools (e.g., MAT) to locate the biggest memory consumer and trace its references.
Inspect the source of the offending framework to understand its request‑processing flow.
The RPC framework serializes request objects into a byte[] buffer. When deserialization fails (e.g., due to mismatched Request class definitions between services), the framework allocates a default 4 GB buffer to store the raw bytes, instantly exhausting heap memory.
Root Cause
Service A’s engineers added new fields to the Request protobuf class without updating Service B. During deserialization, the mismatch caused failure, triggering the allocation of a 4 GB byte[] as a fallback, leading to OOM.
Solution
Reduce the default buffer size in the RPC framework from 4 GB to a reasonable limit such as 4 MB.
Ensure that Request class definitions remain consistent across all services.
After applying these changes, the OOM issue disappeared and service stability was restored.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Java Interview Crash Guide
Dedicated to sharing Java interview Q&A; follow and reply "java" to receive a free premium Java interview guide.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
