Why Did My Java Service Hit 100% CPU? A Deep Dive into a BouncyCastle Memory Leak
The article walks through a real‑world Java production incident where CPU spiked to 100%, detailing systematic troubleshooting steps, heap analysis with MAT, and the discovery that repeatedly creating BouncyCastleProvider objects caused a memory leak that was fixed by refactoring the code.
1. Problem Discovery
The CPU usage of the online machines rose steadily from April 8th, eventually reaching 100% and making the service unavailable; a restart temporarily restored it.
2. Investigation Approach
The possible causes were divided into five directions:
System code issues
Downstream system problems causing avalanche effect
Sudden surge in upstream calls
Third‑party HTTP request problems
Machine‑level issues
3. Investigation Steps
1. Checked logs – no concentrated errors, so code logic errors were initially ruled out.
2. Contacted downstream systems; their monitoring was normal, eliminating downstream impact.
3. Compared provider interface call volume over seven days – no spike, ruling out business‑side call volume.
4. Inspected TCP status – normal, so third‑party HTTP timeout was excluded.
5. Monitored six machines; all showed rising CPU, indicating no single machine fault.
These steps did not directly locate the root cause.
4. Solution
1. Restarted the five most affected machines to restore service, keeping one machine for analysis.
2. Checked the Tomcat thread PID.
3. Examined system usage of the PID with top -Hp 384.
4. Found threads 4430‑4433 each consuming about 40% CPU.
5. Converted those PIDs to hexadecimal: 114e, 114f, 1150, 1151.
6. Dumped the Java thread stack: sudo -u tomcat jstack -l 384 > /1.txt.
7. Identified that the high‑CPU threads were GC threads.
8. Dumped the Java heap:
sudo -u tomcat jmap -dump:live,format=b,file=/dump201612271310.dat 384.
9. Loaded the heap with MAT and discovered that a javax.crypto.JceSecurity object occupied 95% of memory, pinpointing the issue.
MAT download: http://www.eclipse.org/mat/
10. Examined the reference tree and saw that the BouncyCastleProvider object was held excessively, indicating misuse in the code.
5. Code Analysis
The problematic code creates a new BouncyCastleProvider for each encryption/decryption operation and passes it to Cipher.getInstance().
Tracing Cipher.getInstance() leads to the JDK’s JceSecurity implementation, where verificationProviders repeatedly put and remove, while verificationResults only put into a static map.
The static verificationResults map belongs to JceSecurity, so each encryption adds an entry that never gets garbage‑collected, causing the memory leak.
6. Code Improvement
Make the problematic object static so each class holds a single instance, preventing repeated creation.
7. Summary
When encountering an online issue, follow a systematic investigation:
Check logs.
Check CPU usage.
Check TCP status.
Inspect Java threads with jstack.
Inspect Java heap with jmap.
Analyze the heap with MAT to find non‑collectable objects.
Source: https://www.cnblogs.com/kingszelda/p/9034191.html
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
ITFLY8 Architecture Home
ITFLY8 Architecture Home - focused on architecture knowledge sharing and exchange, covering project management and product design. Includes large-scale distributed website architecture (high performance, high availability, caching, message queues...), design patterns, architecture patterns, big data, project management (SCRUM, PMP, Prince2), product design, and more.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
