Operations 7 min read

Rescuing a Critical CPU Outage: My Step-by-Step Troubleshooting Guide

After a midnight CPU alarm threatened service stability, I walked through rapid diagnosis with top and htop, identified JVM bottlenecks using jstat and async‑profiler, refactored a Java sorting algorithm, added caching, optimized database queries, containerized the service, and set up Prometheus‑Grafana alerts to prevent future incidents.

Efficient Ops
Efficient Ops
Efficient Ops
Rescuing a Critical CPU Outage: My Step-by-Step Troubleshooting Guide

1. Initial Diagnosis: Quickly Locate the Problem

I logged into the server and ran top to view system resource usage. The output showed CPU usage near 100% and load average far exceeding the number of cores. $ top Next, I used htop for more detailed process information and discovered several Java processes consuming most of the CPU.

$ htop

2. JVM‑Level Analysis: Finding Hot Methods

Confirming the issue was in the Java application, I inspected the JVM with jstat to check GC activity. $ jstat -gcutil [PID] 1000 10 The output indicated frequent Full GC, a possible cause of high CPU usage.

I generated a thread dump using jstack and saw many threads in RUNNABLE state executing similar methods. $ jstack [PID] > thread_dump.txt To pinpoint hot methods, I ran async‑profiler and produced a flame graph that highlighted a custom sorting algorithm as the main CPU consumer.

$ ./profiler.sh -d 30 -f cpu_profile.svg [PID]

3. Application‑Layer Optimization: Refactoring the Algorithm

The culprit was a custom sort designed for small data sets but now handling large volumes. I rewrote it using Java 8 parallel streams:

List<Data> sortedData = data.parallelStream()
    .sorted(Comparator.comparing(Data::getKey))
    .collect(Collectors.toList());

I also added a cache to avoid repeated calculations:

@Cacheable("sortedData")
public List<Data> getSortedData() {
    // optimized sorting logic
}

4. Database Optimization: Indexes and Query Improvements

During the investigation I found inefficient SQL queries. Using EXPLAIN revealed a full‑table scan on a large table.

EXPLAIN SELECT * FROM large_table WHERE status = 'ACTIVE';

I created an appropriate index and rewrote part of the ORM query to use native SQL:

CREATE INDEX idx_status ON large_table(status);
@Query(value = "SELECT * FROM large_table WHERE status = :status", nativeQuery = true)
List<LargeTable> findByStatus(@Param("status") String status);

5. Deployment Optimization: Container Isolation

To prevent a single service from affecting the whole system, I containerized the application with Docker:

FROM openjdk:11-jre-slim
COPY target/myapp.jar app.jar
ENTRYPOINT ["java", "-Xmx2g", "-jar", "/app.jar"]

Using Docker Compose I limited CPU and memory resources:

version: '3'
services:
  myapp:
    build: .
    deploy:
      resources:
        limits:
          cpus: '0.50'
          memory: 512M

6. Monitoring & Alerting: Proactive Protection

Finally, I upgraded the monitoring stack with Prometheus and Grafana and added a smarter alert rule for high CPU usage:

- alert: HighCPUUsage
  expr: 100 - (avg by (instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 80
  for: 5m
  labels:
    severity: warning
  annotations:
    summary: "High CPU usage detected"
    description: "CPU usage is above 80% for more than 5 minutes"

Conclusion: Crisis and Growth

After nearly four hours of intensive work, the system recovered, CPU usage dropped below 30%, and response times returned to milliseconds. The incident reinforced the importance of regular code reviews, performance testing, pressure testing, and a robust monitoring and alerting system.

MonitoringDockerPrometheusJava performanceCPU troubleshooting
Efficient Ops
Written by

Efficient Ops

This public account is maintained by Xiaotianguo and friends, regularly publishing widely-read original technical articles. We focus on operations transformation and accompany you throughout your operations career, growing together happily.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.