How I Rescued a Critical Java Service from 100% CPU: A Step‑by‑Step Postmortem

When a midnight CPU alarm threatened a core Java service, I walked through rapid system checks, JVM profiling, algorithm refactoring, database indexing, Docker isolation, and enhanced monitoring to bring CPU usage back below 30% and restore service stability.

dbaplus Community
dbaplus Community
dbaplus Community
How I Rescued a Critical Java Service from 100% CPU: A Step‑by‑Step Postmortem

Introduction

At 4 am an alarming "online CPU alert" woke me up, indicating that a core system was nearing 100% CPU usage, risking user experience, data loss, and performance‑related KPI penalties.

1. Initial Diagnosis – Quick Problem Location

I logged into the server and ran top, which showed CPU usage close to 100% and a load average far exceeding the number of CPU cores. top Next, I used htop to view detailed process information and discovered several Java processes consuming most of the CPU.

$ htop

2. JVM‑Level Analysis – Finding Hot Methods

To investigate the Java side, I first checked garbage‑collection activity with jstat: $ jstat -gcutil [PID] 1000 10 The output revealed frequent Full GCs, a possible cause of the high CPU load.

I then generated a thread dump using jstack: $ jstack [PID] > thread_dump.txt The dump showed many threads in RUNNABLE state executing similar call stacks.

For deeper insight, I ran async‑profiler to produce a flame graph: $ ./profiler.sh -d 30 -f cpu_profile.svg [PID] The flame graph highlighted a custom sorting algorithm that was hogging CPU cycles.

3. Application‑Level Optimization – Refactoring the Algorithm

I rewrote the inefficient sorting logic using Java 8 parallel streams:

List<Data> sortedData = data.parallelStream()
    .sorted(Comparator.comparing(Data::getKey))
    .collect(Collectors.toList());

Additionally, I added a cache to avoid repeated computation:

@Cacheable("sortedData")
public List<Data> getSortedData() {
    // optimized sorting logic
}

4. Database Optimization – Index and Query Improvements

While investigating, I found a slow query that caused a full table scan. Using EXPLAIN I confirmed the issue:

EXPLAIN SELECT * FROM large_table WHERE status = 'ACTIVE';

I created an index on the status column and switched to a more efficient native SQL query:

CREATE INDEX idx_status ON large_table(status);
@Query(value = "SELECT * FROM large_table WHERE status = :status", nativeQuery = true)
List<LargeTable> findByStatus(@Param("status") String status);

5. Deployment Optimization – Resource Isolation

To prevent a single service from affecting the whole system, I containerized the application with Docker:

FROM openjdk:11-jre-slim
COPY target/ myapp.jar app.jar
ENTRYPOINT ["java", "-Xmx2g", "-jar", "/app.jar"]

Using Docker Compose, I limited CPU and memory resources:

version: '3'
services:
  myapp:
    build: .
    deploy:
      resources:
        limits:
          cpus: '0.50'
          memory: 512M

6. Monitoring & Alerting – Preventing Future Incidents

I upgraded the monitoring stack to Prometheus and Grafana and added a smarter alert rule for high CPU usage:

- alert: HighCPUUsage
  expr: 100 - (avg by(instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 80
  for: 5m
  labels:
    severity: warning
  annotations:
    summary: "High CPU usage detected"
    description: "CPU usage is above 80% for more than 5 minutes"

Conclusion

After nearly four hours of investigation and remediation, CPU usage dropped below 30% and response times returned to milliseconds. The incident underscored the importance of regular code reviews, performance testing, and robust monitoring to catch and resolve performance bottlenecks before they impact business outcomes.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

JavaJVMDockerCPU
dbaplus Community
Written by

dbaplus Community

Enterprise-level professional community for Database, BigData, and AIOps. Daily original articles, weekly online tech talks, monthly offline salons, and quarterly XCOPS&DAMS conferences—delivered by industry experts.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.