How I Rescued a Critical Service from 100% CPU: A Step‑by‑Step Ops Playbook
After a midnight CPU alarm, I walked through rapid diagnosis, JVM profiling, algorithm refactoring, database indexing, Docker isolation, and enhanced monitoring to bring a high‑load Java service back to stability, illustrating a comprehensive incident‑response workflow for modern operations teams.
Preface
At 4 am an alarming "online CPU alert" woke me up, signaling a critical performance issue that could damage user experience, cause data loss, and jeopardize my year‑end performance review.
1. Initial Diagnosis: Quickly Locate the Problem
I logged into the server and ran:
<code>$ top</code>The output showed CPU usage near 100 % and load average far exceeding the number of cores.
Next, I used
htopto get detailed process information and discovered several Java processes consuming most of the CPU.
2. JVM‑Level Analysis: Find Hot Methods
Focusing on the Java application, I checked GC activity:
<code>$ jstat -gcutil [PID] 1000 10</code>Frequent Full GC indicated a possible cause of high CPU usage.
I generated a thread dump:
<code>$ jstack [PID] > thread_dump.txt</code>The dump revealed many threads in RUNNABLE state executing similar call stacks.
To pinpoint hot methods, I ran async‑profiler:
<code>$ ./profiler.sh -d 30 -f cpu_profile.svg [PID]</code>The flame graph highlighted a custom sorting algorithm that was hogging CPU cycles.
3. Application‑Level Optimization: Refactor the Algorithm
I rewrote the sorting logic using Java 8 parallel streams:
<code>List<Data> sortedData = data.parallelStream()
.sorted(Comparator.comparing(Data::getKey))
.collect(Collectors.toList());</code>Additionally, I added caching to avoid repeated computation:
<code>@Cacheable("sortedData")
public List<Data> getSortedData() {
// optimized sorting logic
}</code>4. Database Optimization: Index and Query Improvements
I examined slow SQL with
EXPLAIN:
<code>EXPLAIN SELECT * FROM large_table WHERE status = 'ACTIVE';</code>The plan showed a full table scan, so I created an index:
<code>CREATE INDEX idx_status ON large_table(status);</code>I also replaced some ORM queries with native SQL for better performance:
<code>@Query(value = "SELECT * FROM large_table WHERE status = :status", nativeQuery = true)
List<LargeTable> findByStatus(@Param("status") String status);</code>5. Deployment Optimization: Resource Isolation
To prevent a single service from affecting the whole system, I containerized the application with Docker:
<code>FROM openjdk:11-jre-slim
COPY target/myapp.jar app.jar
ENTRYPOINT ["java", "-Xmx2g", "-jar", "/app.jar"]</code>Using Docker Compose, I limited CPU and memory:
<code>version: '3'
services:
myapp:
build: .
deploy:
resources:
limits:
cpus: '0.50'
memory: 512M</code>6. Monitoring & Alerting: Prevent Recurrence
I upgraded the monitoring stack with Prometheus and Grafana and added a smarter alert rule:
<code>- alert: HighCPUUsage
expr: 100 - (avg by(instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 80
for: 5m
labels:
severity: warning
annotations:
summary: "High CPU usage detected"
description: "CPU usage is above 80% for more than 5 minutes"</code>Conclusion: Crisis and Growth
After nearly four hours of intensive work, CPU usage dropped below 30 %, and response times returned to millisecond levels. The incident reinforced the importance of regular code reviews, performance testing, and robust monitoring to avoid technical debt and ensure system reliability.
Efficient Ops
This public account is maintained by Xiaotianguo and friends, regularly publishing widely-read original technical articles. We focus on operations transformation and accompany you throughout your operations career, growing together happily.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.