How I Rescued a Critical Service: A Step‑by‑Step CPU Overload Debugging Guide
When a midnight CPU alarm threatened service availability, I walked through rapid system checks, JVM profiling, algorithm refactoring, database indexing, Docker isolation, and Prometheus alerting to bring CPU usage back below 30% and restore millisecond‑level response times.
Initial Diagnosis
Log into the host and run top to verify that CPU usage is close to 100 % and that the load average exceeds the number of CPU cores. Then run htop to list processes and identify the Java processes that consume the most CPU.
JVM‑Level Analysis
Monitor garbage‑collection activity with: jstat -gcutil <PID> 1000 10 Frequent Full GCs often correlate with high CPU load. Capture a thread dump: jstack <PID> > thread_dump.txt Inspect the dump for many threads in the RUNNABLE state executing similar call stacks. Profile the JVM using async‑profiler to pinpoint the hotspot:
./profiler.sh -d 30 -f cpu_profile.svg <PID>The generated flame graph highlights a custom sorting algorithm as the dominant CPU consumer.
Application‑Level Refactor
Replace the sequential sort with a parallel stream to leverage multiple cores:
List<Data> sortedData = data.parallelStream()
.sorted(Comparator.comparing(Data::getKey))
.collect(Collectors.toList());Introduce caching to avoid repeated sorting of the same data set:
@Cacheable("sortedData")
public List<Data> getSortedData() {
// optimized sorting logic
}Database Optimisation
Analyse the slow query with EXPLAIN:
EXPLAIN SELECT * FROM large_table WHERE status = 'ACTIVE';The plan shows a full table scan. Create an index on the status column to enable index‑based lookups: CREATE INDEX idx_status ON large_table(status); Rewrite the ORM method to use native SQL, which allows the database to apply the new index efficiently:
@Query(value = "SELECT * FROM large_table WHERE status = :status", nativeQuery = true)
List<LargeTable> findByStatus(@Param("status") String status);Containerised Deployment
Build a lightweight Docker image for the service:
FROM openjdk:11-jre-slim
COPY target/myapp.jar app.jar
ENTRYPOINT ["java", "-Xmx2g", "-jar", "/app.jar"]Use Docker Compose to enforce resource limits, preventing a single container from monopolising the host:
version: '3'
services:
myapp:
build: .
deploy:
resources:
limits:
cpus: '0.50'
memory: 512MMonitoring & Alerting
Upgrade the monitoring stack with Prometheus and Grafana and add an alert that fires when CPU idle time stays below 20 % for five minutes:
- alert: HighCPUUsage
expr: 100 - (avg by(instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 80
for: 5m
labels:
severity: warning
annotations:
summary: "High CPU usage detected"
description: "CPU usage is above 80% for more than 5 minutes"Result
After applying the diagnostics, code refactor, database tuning, container isolation, and improved alerting, the service’s CPU usage fell below 30 % and response latency returned to the millisecond range.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Architect
Professional architect sharing high‑quality architecture insights. Topics include high‑availability, high‑performance, high‑stability architectures, big data, machine learning, Java, system and distributed architecture, AI, and practical large‑scale architecture case studies. Open to ideas‑driven architects who enjoy sharing and learning.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
