How I Rescued a Critical Service from 100% CPU: A Step‑by‑Step Debugging Guide
When a midnight CPU alarm triggered, I logged into the server, identified runaway Java processes, profiled the JVM, refactored a costly sorting algorithm, added database indexes, containerized the service, and set up Prometheus alerts, ultimately reducing CPU usage below 30% and restoring millisecond response times.
1. Initial Diagnosis: Quickly Locate the Issue
At 4 am an alert indicated “online CPU alarm”. Logging into the server and running $ top showed CPU usage near 100 % and load average far exceeding the number of cores. The $ htop output revealed several Java processes consuming most of the CPU.
2. JVM‑Level Analysis: Find Hot Methods
Using $ jstat -gcutil [PID] 1000 10 showed frequent Full GCs, suggesting GC pressure. A thread dump generated with $ jstack [PID] > thread_dump.txt contained many RUNNABLE threads executing similar call stacks. To pinpoint the hot method, async‑profiler was run: $ ./profiler.sh -d 30 -f cpu_profile.svg [PID] The flame graph highlighted a custom sorting algorithm that dominated CPU time.
3. Application‑Level Optimization: Refactor the Algorithm
The identified sorting routine was designed for small data sets and became a bottleneck at scale. It was rewritten using Java 8 parallel streams:
List<Data> sortedData = data.parallelStream()
.sorted(Comparator.comparing(Data::getKey))
.collect(Collectors.toList());A cache was added to avoid repeated work:
@Cacheable("sortedData")
public List<Data> getSortedData() {
// optimized sorting logic
}4. Database Optimization: Indexes and Query Tuning
Slow queries were discovered with EXPLAIN SELECT * FROM large_table WHERE status = 'ACTIVE';, which caused a full table scan. An index was created: CREATE INDEX idx_status ON large_table(status); ORM queries were replaced by native SQL for better performance:
@Query(value = "SELECT * FROM large_table WHERE status = :status", nativeQuery = true)
List<LargeTable> findByStatus(@Param("status") String status);5. Deployment Optimization: Resource Isolation
Docker was used to isolate the service. A minimal Dockerfile runs the application with a 2 GB heap:
FROM openjdk:11-jre-slim
COPY target/myapp.jar app.jar
ENTRYPOINT ["java","-Xmx2g","-jar","/app.jar"]Docker‑Compose limits were added to cap CPU and memory:
version: '3'
services:
myapp:
build: .
deploy:
resources:
limits:
cpus: '0.50'
memory: 512M6. Monitoring & Alerting: Prevent Recurrence
Prometheus and Grafana were deployed to provide full‑stack visibility. A new alert rule detects sustained high CPU usage:
- alert: HighCPUUsage
expr: 100 - (avg by(instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 80
for: 5m
labels:
severity: warning
annotations:
summary: "High CPU usage detected"
description: "CPU usage is above 80% for more than 5 minutes"After applying these changes, CPU usage dropped below 30 %, response times returned to millisecond levels, and the incident was resolved within four hours.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Liangxu Linux
Liangxu, a self‑taught IT professional now working as a Linux development engineer at a Fortune 500 multinational, shares extensive Linux knowledge—fundamentals, applications, tools, plus Git, databases, Raspberry Pi, etc. (Reply “Linux” to receive essential resources.)
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
