Operations 6 min read

How I Rescued a Critical Service from 100% CPU: A Step‑by‑Step Debugging Guide

When a midnight CPU alarm triggered, I logged into the server, identified runaway Java processes, profiled the JVM, refactored a costly sorting algorithm, added database indexes, containerized the service, and set up Prometheus alerts, ultimately reducing CPU usage below 30% and restoring millisecond response times.

Liangxu Linux

May 18, 2025

How I Rescued a Critical Service from 100% CPU: A Step‑by‑Step Debugging Guide

1. Initial Diagnosis: Quickly Locate the Issue

At 4 am an alert indicated “online CPU alarm”. Logging into the server and running $ top showed CPU usage near 100 % and load average far exceeding the number of cores. The $ htop output revealed several Java processes consuming most of the CPU.

2. JVM‑Level Analysis: Find Hot Methods

Using $ jstat -gcutil [PID] 1000 10 showed frequent Full GCs, suggesting GC pressure. A thread dump generated with $ jstack [PID] > thread_dump.txt contained many RUNNABLE threads executing similar call stacks. To pinpoint the hot method, async‑profiler was run: $ ./profiler.sh -d 30 -f cpu_profile.svg [PID] The flame graph highlighted a custom sorting algorithm that dominated CPU time.

3. Application‑Level Optimization: Refactor the Algorithm

The identified sorting routine was designed for small data sets and became a bottleneck at scale. It was rewritten using Java 8 parallel streams:

List<Data> sortedData = data.parallelStream()
    .sorted(Comparator.comparing(Data::getKey))
    .collect(Collectors.toList());

A cache was added to avoid repeated work:

@Cacheable("sortedData")
public List<Data> getSortedData() {
    // optimized sorting logic
}

4. Database Optimization: Indexes and Query Tuning

Slow queries were discovered with EXPLAIN SELECT * FROM large_table WHERE status = 'ACTIVE';, which caused a full table scan. An index was created: CREATE INDEX idx_status ON large_table(status); ORM queries were replaced by native SQL for better performance:

@Query(value = "SELECT * FROM large_table WHERE status = :status", nativeQuery = true)
List<LargeTable> findByStatus(@Param("status") String status);

5. Deployment Optimization: Resource Isolation

Docker was used to isolate the service. A minimal Dockerfile runs the application with a 2 GB heap:

FROM openjdk:11-jre-slim
COPY target/myapp.jar app.jar
ENTRYPOINT ["java","-Xmx2g","-jar","/app.jar"]

Docker‑Compose limits were added to cap CPU and memory:

version: '3'
services:
  myapp:
    build: .
    deploy:
      resources:
        limits:
          cpus: '0.50'
          memory: 512M

6. Monitoring & Alerting: Prevent Recurrence

Prometheus and Grafana were deployed to provide full‑stack visibility. A new alert rule detects sustained high CPU usage:

- alert: HighCPUUsage
  expr: 100 - (avg by(instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 80
  for: 5m
  labels:
    severity: warning
  annotations:
    summary: "High CPU usage detected"
    description: "CPU usage is above 80% for more than 5 minutes"

After applying these changes, CPU usage dropped below 30 %, response times returned to millisecond levels, and the incident was resolved within four hours.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Java JVM Monitoring Docker prometheus CPU

Written by

Liangxu Linux

Liangxu, a self‑taught IT professional now working as a Linux development engineer at a Fortune 500 multinational, shares extensive Linux knowledge—fundamentals, applications, tools, plus Git, databases, Raspberry Pi, etc. (Reply “Linux” to receive essential resources.)

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.