Rescuing a Critical CPU Outage: My Step-by-Step Troubleshooting Guide
After a midnight CPU alarm threatened service stability, I walked through rapid diagnosis with top and htop, identified JVM bottlenecks using jstat and async‑profiler, refactored a Java sorting algorithm, added caching, optimized database queries, containerized the service, and set up Prometheus‑Grafana alerts to prevent future incidents.
1. Initial Diagnosis: Quickly Locate the Problem
I logged into the server and ran top to view system resource usage. The output showed CPU usage near 100% and load average far exceeding the number of cores. $ top Next, I used htop for more detailed process information and discovered several Java processes consuming most of the CPU.
$ htop2. JVM‑Level Analysis: Finding Hot Methods
Confirming the issue was in the Java application, I inspected the JVM with jstat to check GC activity. $ jstat -gcutil [PID] 1000 10 The output indicated frequent Full GC, a possible cause of high CPU usage.
I generated a thread dump using jstack and saw many threads in RUNNABLE state executing similar methods. $ jstack [PID] > thread_dump.txt To pinpoint hot methods, I ran async‑profiler and produced a flame graph that highlighted a custom sorting algorithm as the main CPU consumer.
$ ./profiler.sh -d 30 -f cpu_profile.svg [PID]3. Application‑Layer Optimization: Refactoring the Algorithm
The culprit was a custom sort designed for small data sets but now handling large volumes. I rewrote it using Java 8 parallel streams:
List<Data> sortedData = data.parallelStream()
.sorted(Comparator.comparing(Data::getKey))
.collect(Collectors.toList());I also added a cache to avoid repeated calculations:
@Cacheable("sortedData")
public List<Data> getSortedData() {
// optimized sorting logic
}4. Database Optimization: Indexes and Query Improvements
During the investigation I found inefficient SQL queries. Using EXPLAIN revealed a full‑table scan on a large table.
EXPLAIN SELECT * FROM large_table WHERE status = 'ACTIVE';I created an appropriate index and rewrote part of the ORM query to use native SQL:
CREATE INDEX idx_status ON large_table(status); @Query(value = "SELECT * FROM large_table WHERE status = :status", nativeQuery = true)
List<LargeTable> findByStatus(@Param("status") String status);5. Deployment Optimization: Container Isolation
To prevent a single service from affecting the whole system, I containerized the application with Docker:
FROM openjdk:11-jre-slim
COPY target/myapp.jar app.jar
ENTRYPOINT ["java", "-Xmx2g", "-jar", "/app.jar"]Using Docker Compose I limited CPU and memory resources:
version: '3'
services:
myapp:
build: .
deploy:
resources:
limits:
cpus: '0.50'
memory: 512M6. Monitoring & Alerting: Proactive Protection
Finally, I upgraded the monitoring stack with Prometheus and Grafana and added a smarter alert rule for high CPU usage:
- alert: HighCPUUsage
expr: 100 - (avg by (instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 80
for: 5m
labels:
severity: warning
annotations:
summary: "High CPU usage detected"
description: "CPU usage is above 80% for more than 5 minutes"Conclusion: Crisis and Growth
After nearly four hours of intensive work, the system recovered, CPU usage dropped below 30%, and response times returned to milliseconds. The incident reinforced the importance of regular code reviews, performance testing, pressure testing, and a robust monitoring and alerting system.
Efficient Ops
This public account is maintained by Xiaotianguo and friends, regularly publishing widely-read original technical articles. We focus on operations transformation and accompany you throughout your operations career, growing together happily.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
