Operations 7 min read

How I Rescued a Critical Service: A Step‑by‑Step CPU Overload Debugging Guide

When a midnight CPU alarm threatened service availability, I walked through rapid system checks, JVM profiling, algorithm refactoring, database indexing, Docker isolation, and Prometheus alerting to bring CPU usage back below 30% and restore millisecond‑level response times.

Architect

May 15, 2025

How I Rescued a Critical Service: A Step‑by‑Step CPU Overload Debugging Guide

Initial Diagnosis

Log into the host and run top to verify that CPU usage is close to 100 % and that the load average exceeds the number of CPU cores. Then run htop to list processes and identify the Java processes that consume the most CPU.

JVM‑Level Analysis

Monitor garbage‑collection activity with: jstat -gcutil <PID> 1000 10 Frequent Full GCs often correlate with high CPU load. Capture a thread dump: jstack <PID> > thread_dump.txt Inspect the dump for many threads in the RUNNABLE state executing similar call stacks. Profile the JVM using async‑profiler to pinpoint the hotspot:

./profiler.sh -d 30 -f cpu_profile.svg <PID>

The generated flame graph highlights a custom sorting algorithm as the dominant CPU consumer.

Application‑Level Refactor

Replace the sequential sort with a parallel stream to leverage multiple cores:

List<Data> sortedData = data.parallelStream()
    .sorted(Comparator.comparing(Data::getKey))
    .collect(Collectors.toList());

Introduce caching to avoid repeated sorting of the same data set:

@Cacheable("sortedData")
public List<Data> getSortedData() {
    // optimized sorting logic
}

Database Optimisation

Analyse the slow query with EXPLAIN:

EXPLAIN SELECT * FROM large_table WHERE status = 'ACTIVE';

The plan shows a full table scan. Create an index on the status column to enable index‑based lookups: CREATE INDEX idx_status ON large_table(status); Rewrite the ORM method to use native SQL, which allows the database to apply the new index efficiently:

@Query(value = "SELECT * FROM large_table WHERE status = :status", nativeQuery = true)
List<LargeTable> findByStatus(@Param("status") String status);

Containerised Deployment

Build a lightweight Docker image for the service:

FROM openjdk:11-jre-slim
COPY target/myapp.jar app.jar
ENTRYPOINT ["java", "-Xmx2g", "-jar", "/app.jar"]

Use Docker Compose to enforce resource limits, preventing a single container from monopolising the host:

version: '3'
services:
  myapp:
    build: .
    deploy:
      resources:
        limits:
          cpus: '0.50'
          memory: 512M

Monitoring & Alerting

Upgrade the monitoring stack with Prometheus and Grafana and add an alert that fires when CPU idle time stays below 20 % for five minutes:

- alert: HighCPUUsage
  expr: 100 - (avg by(instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 80
  for: 5m
  labels:
    severity: warning
  annotations:
    summary: "High CPU usage detected"
    description: "CPU usage is above 80% for more than 5 minutes"

Result

After applying the diagnostics, code refactor, database tuning, container isolation, and improved alerting, the service’s CPU usage fell below 30 % and response latency returned to the millisecond range.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

JVM monitoring performance Docker Prometheus troubleshooting

Written by

Architect

Professional architect sharing high‑quality architecture insights. Topics include high‑availability, high‑performance, high‑stability architectures, big data, machine learning, Java, system and distributed architecture, AI, and practical large‑scale architecture case studies. Open to ideas‑driven architects who enjoy sharing and learning.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.