How I Rescued a Critical Java Service from 100% CPU: A Step‑by‑Step Debugging Guide
When a midnight CPU alarm threatened a core Java service, I raced through system checks, JVM profiling, algorithm refactoring, database indexing, Docker isolation, and Prometheus alerts, ultimately restoring performance and highlighting the importance of proactive monitoring and technical debt management.
1. Initial Diagnosis: Quick System Check
I logged into the server and ran top, which showed CPU usage near 100% and a load average far exceeding the number of cores. $ top Next, I used htop to see detailed process information and identified several Java processes consuming most of the CPU.
$ htop2. JVM‑Level Analysis: Finding Hot Methods
After confirming the issue was in the Java application, I inspected the JVM. The jstat command revealed frequent Full GC cycles. $ jstat -gcutil [PID] 1000 10 I generated a thread dump with jstack and saw many threads in RUNNABLE state executing similar call stacks. $ jstack [PID] > thread_dump.txt To pinpoint the hot method, I ran async-profiler and produced a flame graph, which highlighted a custom sorting algorithm as the main CPU consumer.
$ ./profiler.sh -d 30 -f cpu_profile.svg [PID]3. Application‑Layer Optimization: Refactoring the Algorithm
Discovering that the custom sort was designed for small data sets, I rewrote it using Java 8 parallel streams.
List<Data> sortedData = data.parallelStream()
.sorted(Comparator.comparing(Data::getKey))
.collect(Collectors.toList());I also added a cache to avoid repeated calculations.
@Cacheable("sortedData")
public List<Data> getSortedData(){
// optimized sorting logic
}4. Database Optimization: Index and Query Improvements
While investigating, I found an inefficient query that caused a full table scan. Using EXPLAIN confirmed the problem.
EXPLAIN SELECT * FROM large_table WHERE status = 'ACTIVE';I created an index on the status column and switched part of the ORM code to native SQL for better performance.
CREATE INDEX idx_status ON large_table(status); @Query(value = "SELECT * FROM large_table WHERE status = :status", nativeQuery = true)
List<LargeTable> findByStatus(@Param("status") String status);5. Deployment Optimization: Resource Isolation with Docker
To prevent a single service from affecting the whole system, I containerized the application.
FROM openjdk:11-jre-slim
COPY target/myapp.jar app.jar
ENTRYPOINT ["java", "-Xmx2g", "-jar", "/app.jar"]Using Docker Compose, I limited CPU and memory for the container.
version: '3'
services:
myapp:
build: .
deploy:
resources:
limits:
cpus: '0.50'
memory: 512M6. Monitoring & Alerting: Prevent Future Outages
I upgraded the monitoring stack with Prometheus and Grafana, adding a smarter alert rule for high CPU usage.
-alert: HighCPUUsage
expr: 100 - (avgby(instance)(rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 80
for: 5m
labels:
severity: warning
annotations:
summary: "High CPU usage detected"
description: "CPU usage is above 80% for more than 5 minutes"Conclusion
After nearly four hours of intensive work, the system stabilized with CPU usage below 30% and response times returning to millisecond levels. The incident underscored the need for regular code reviews, performance testing, and robust monitoring to avoid technical debt pitfalls.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
ITPUB
Official ITPUB account sharing technical insights, community news, and exciting events.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
