How I Rescued a Critical Java Service from 100% CPU: A Step‑by‑Step Postmortem
When a midnight CPU alarm threatened a core Java service, I walked through rapid system checks, JVM profiling, algorithm refactoring, database indexing, Docker isolation, and enhanced monitoring to bring CPU usage back below 30% and restore service stability.
Introduction
At 4 am an alarming "online CPU alert" woke me up, indicating that a core system was nearing 100% CPU usage, risking user experience, data loss, and performance‑related KPI penalties.
1. Initial Diagnosis – Quick Problem Location
I logged into the server and ran top, which showed CPU usage close to 100% and a load average far exceeding the number of CPU cores. top Next, I used htop to view detailed process information and discovered several Java processes consuming most of the CPU.
$ htop2. JVM‑Level Analysis – Finding Hot Methods
To investigate the Java side, I first checked garbage‑collection activity with jstat: $ jstat -gcutil [PID] 1000 10 The output revealed frequent Full GCs, a possible cause of the high CPU load.
I then generated a thread dump using jstack: $ jstack [PID] > thread_dump.txt The dump showed many threads in RUNNABLE state executing similar call stacks.
For deeper insight, I ran async‑profiler to produce a flame graph: $ ./profiler.sh -d 30 -f cpu_profile.svg [PID] The flame graph highlighted a custom sorting algorithm that was hogging CPU cycles.
3. Application‑Level Optimization – Refactoring the Algorithm
I rewrote the inefficient sorting logic using Java 8 parallel streams:
List<Data> sortedData = data.parallelStream()
.sorted(Comparator.comparing(Data::getKey))
.collect(Collectors.toList());Additionally, I added a cache to avoid repeated computation:
@Cacheable("sortedData")
public List<Data> getSortedData() {
// optimized sorting logic
}4. Database Optimization – Index and Query Improvements
While investigating, I found a slow query that caused a full table scan. Using EXPLAIN I confirmed the issue:
EXPLAIN SELECT * FROM large_table WHERE status = 'ACTIVE';I created an index on the status column and switched to a more efficient native SQL query:
CREATE INDEX idx_status ON large_table(status); @Query(value = "SELECT * FROM large_table WHERE status = :status", nativeQuery = true)
List<LargeTable> findByStatus(@Param("status") String status);5. Deployment Optimization – Resource Isolation
To prevent a single service from affecting the whole system, I containerized the application with Docker:
FROM openjdk:11-jre-slim
COPY target/ myapp.jar app.jar
ENTRYPOINT ["java", "-Xmx2g", "-jar", "/app.jar"]Using Docker Compose, I limited CPU and memory resources:
version: '3'
services:
myapp:
build: .
deploy:
resources:
limits:
cpus: '0.50'
memory: 512M6. Monitoring & Alerting – Preventing Future Incidents
I upgraded the monitoring stack to Prometheus and Grafana and added a smarter alert rule for high CPU usage:
- alert: HighCPUUsage
expr: 100 - (avg by(instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 80
for: 5m
labels:
severity: warning
annotations:
summary: "High CPU usage detected"
description: "CPU usage is above 80% for more than 5 minutes"Conclusion
After nearly four hours of investigation and remediation, CPU usage dropped below 30% and response times returned to milliseconds. The incident underscored the importance of regular code reviews, performance testing, and robust monitoring to catch and resolve performance bottlenecks before they impact business outcomes.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
dbaplus Community
Enterprise-level professional community for Database, BigData, and AIOps. Daily original articles, weekly online tech talks, monthly offline salons, and quarterly XCOPS&DAMS conferences—delivered by industry experts.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
