Memory Leak Investigation of Sentinel in Microservices: Root Cause and Fix
A production Sentinel downgrade rule triggered a severe memory leak that caused slow requests, CPU spikes, frequent full GCs, and OOM errors, which were traced to a SkyWalking bug handling empty stack traces and resolved by updating SkyWalking and adjusting Sentinel's exception handling.
To implement circuit breaking, rate limiting, and degradation in a microservice architecture, the team introduced the Sentinel component with customizations such as persistent rule storage, monitoring aggregation, and unified login permissions.
After successful functional verification and performance testing, Sentinel was rolled out to production in a gray deployment. When a degradation rule was activated online, the monitoring system flooded with alerts, and the service began to experience a large number of slow requests that eventually led to complete service freeze.
Investigation showed massive slow requests, CPU spikes, frequent full GC, and Java OutOfMemoryError. The team removed the rule, restarted the service, and reproduced the issue on a pre‑release machine. They dumped the JVM heap using jmap -dump:format=b,file=dump.bin ${pid}, compressed the dump with tar, and analyzed it with Eclipse MAT ( https://www.eclipse.org/mat/ ).
The heap analysis revealed that each of the 200 Dubbo threads consumed about 2% of memory; one thread held a 200 MB StringBuilder filled with repetitive strings. Further inspection of the thread stack with jstack ${pid} > jstack.txt showed a SkyWalking thread blocked in Arrays.copyOf. SkyWalking, a distributed tracing agent, appends exception information to a StringBuilder.
SkyWalking’s handling of exceptions from Sentinel caused the problem: Sentinel’s BlockException overrides fillInStackTrace() to return this, producing an empty stack trace. When SkyWalking processes such an exception, it repeatedly appends to the StringBuilder because the stack trace is null, eventually exhausting heap memory.
The bug was fixed in newer versions of SkyWalking (see https://github.com/apache/skywalking/pull/2931 ). The fix limits the number of recorded exception layers and prevents OOM when the stack trace is empty.
This incident highlights that performance testing must cover not only normal traffic but also exceptional scenarios; a system that handles 5,000 QPS under normal load may fail under 2,000 QPS when exceptions occur.
Summary:
Memory leaks often accompany CPU spikes, high error rates, and frequent GC.
Obtaining a heap dump and analyzing it with source code insight is crucial for diagnosing leaks.
If dump analysis does not reveal the cause, additional tools such as jstack should be employed.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Wukong Talks Architecture
Explaining distributed systems and architecture through stories. Author of the "JVM Performance Tuning in Practice" column, open-source author of "Spring Cloud in Practice PassJava", and independently developed a PMP practice quiz mini-program.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
