How I Rescued a Critical Java Service from 100% CPU: A Step‑by‑Step Debugging Guide

When a midnight CPU alarm threatened a core Java service, I raced through system checks, JVM profiling, algorithm refactoring, database indexing, Docker isolation, and Prometheus alerts, ultimately restoring performance and highlighting the importance of proactive monitoring and technical debt management.

ITPUB
ITPUB
ITPUB
How I Rescued a Critical Java Service from 100% CPU: A Step‑by‑Step Debugging Guide

1. Initial Diagnosis: Quick System Check

I logged into the server and ran top, which showed CPU usage near 100% and a load average far exceeding the number of cores. $ top Next, I used htop to see detailed process information and identified several Java processes consuming most of the CPU.

$ htop

2. JVM‑Level Analysis: Finding Hot Methods

After confirming the issue was in the Java application, I inspected the JVM. The jstat command revealed frequent Full GC cycles. $ jstat -gcutil [PID] 1000 10 I generated a thread dump with jstack and saw many threads in RUNNABLE state executing similar call stacks. $ jstack [PID] > thread_dump.txt To pinpoint the hot method, I ran async-profiler and produced a flame graph, which highlighted a custom sorting algorithm as the main CPU consumer.

$ ./profiler.sh -d 30 -f cpu_profile.svg [PID]

3. Application‑Layer Optimization: Refactoring the Algorithm

Discovering that the custom sort was designed for small data sets, I rewrote it using Java 8 parallel streams.

List<Data> sortedData = data.parallelStream()
    .sorted(Comparator.comparing(Data::getKey))
    .collect(Collectors.toList());

I also added a cache to avoid repeated calculations.

@Cacheable("sortedData")
public List<Data> getSortedData(){
    // optimized sorting logic
}

4. Database Optimization: Index and Query Improvements

While investigating, I found an inefficient query that caused a full table scan. Using EXPLAIN confirmed the problem.

EXPLAIN SELECT * FROM large_table WHERE status = 'ACTIVE';

I created an index on the status column and switched part of the ORM code to native SQL for better performance.

CREATE INDEX idx_status ON large_table(status);
@Query(value = "SELECT * FROM large_table WHERE status = :status", nativeQuery = true)
List<LargeTable> findByStatus(@Param("status") String status);

5. Deployment Optimization: Resource Isolation with Docker

To prevent a single service from affecting the whole system, I containerized the application.

FROM openjdk:11-jre-slim
COPY target/myapp.jar app.jar
ENTRYPOINT ["java", "-Xmx2g", "-jar", "/app.jar"]

Using Docker Compose, I limited CPU and memory for the container.

version: '3'
services:
  myapp:
    build: .
    deploy:
      resources:
        limits:
          cpus: '0.50'
          memory: 512M

6. Monitoring & Alerting: Prevent Future Outages

I upgraded the monitoring stack with Prometheus and Grafana, adding a smarter alert rule for high CPU usage.

-alert: HighCPUUsage
  expr: 100 - (avgby(instance)(rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 80
  for: 5m
  labels:
    severity: warning
  annotations:
    summary: "High CPU usage detected"
    description: "CPU usage is above 80% for more than 5 minutes"

Conclusion

After nearly four hours of intensive work, the system stabilized with CPU usage below 30% and response times returning to millisecond levels. The incident underscored the need for regular code reviews, performance testing, and robust monitoring to avoid technical debt pitfalls.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

JavaJVMDockerCPU
ITPUB
Written by

ITPUB

Official ITPUB account sharing technical insights, community news, and exciting events.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.