Operations 7 min read

How I Rescued a Critical Service from 100% CPU: A Step‑by‑Step Ops Playbook

After a midnight CPU alarm, I walked through rapid diagnosis, JVM profiling, algorithm refactoring, database indexing, Docker isolation, and enhanced monitoring to bring a high‑load Java service back to stability, illustrating a comprehensive incident‑response workflow for modern operations teams.

Efficient Ops

Jan 19, 2025

How I Rescued a Critical Service from 100% CPU: A Step‑by‑Step Ops Playbook

Preface

At 4 am an alarming "online CPU alert" woke me up, signaling a critical performance issue that could damage user experience, cause data loss, and jeopardize my year‑end performance review.

1. Initial Diagnosis: Quickly Locate the Problem

I logged into the server and ran: $ top The output showed CPU usage near 100 % and load average far exceeding the number of cores.

Next, I used htop to get detailed process information and discovered several Java processes consuming most of the CPU.

2. JVM‑Level Analysis: Find Hot Methods

Focusing on the Java application, I checked GC activity: $ jstat -gcutil [PID] 1000 10 Frequent Full GC indicated a possible cause of high CPU usage.

I generated a thread dump: $ jstack [PID] > thread_dump.txt The dump revealed many threads in RUNNABLE state executing similar call stacks.

To pinpoint hot methods, I ran async‑profiler: $ ./profiler.sh -d 30 -f cpu_profile.svg [PID] The flame graph highlighted a custom sorting algorithm that was hogging CPU cycles.

3. Application‑Level Optimization: Refactor the Algorithm

I rewrote the sorting logic using Java 8 parallel streams:

List<Data> sortedData = data.parallelStream()
    .sorted(Comparator.comparing(Data::getKey))
    .collect(Collectors.toList());

Additionally, I added caching to avoid repeated computation:

@Cacheable("sortedData")
public List<Data> getSortedData() {
    // optimized sorting logic
}

4. Database Optimization: Index and Query Improvements

I examined slow SQL with EXPLAIN:

EXPLAIN SELECT * FROM large_table WHERE status = 'ACTIVE';

The plan showed a full table scan, so I created an index: CREATE INDEX idx_status ON large_table(status); I also replaced some ORM queries with native SQL for better performance:

@Query(value = "SELECT * FROM large_table WHERE status = :status", nativeQuery = true)
List<LargeTable> findByStatus(@Param("status") String status);

5. Deployment Optimization: Resource Isolation

To prevent a single service from affecting the whole system, I containerized the application with Docker:

FROM openjdk:11-jre-slim
COPY target/myapp.jar app.jar
ENTRYPOINT ["java", "-Xmx2g", "-jar", "/app.jar"]

Using Docker Compose, I limited CPU and memory:

version: '3'
services:
  myapp:
    build: .
    deploy:
      resources:
        limits:
          cpus: '0.50'
          memory: 512M

6. Monitoring & Alerting: Prevent Recurrence

I upgraded the monitoring stack with Prometheus and Grafana and added a smarter alert rule:

- alert: HighCPUUsage
  expr: 100 - (avg by(instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 80
  for: 5m
  labels:
    severity: warning
  annotations:
    summary: "High CPU usage detected"
    description: "CPU usage is above 80% for more than 5 minutes"

Conclusion: Crisis and Growth

After nearly four hours of intensive work, CPU usage dropped below 30 %, and response times returned to millisecond levels. The incident reinforced the importance of regular code reviews, performance testing, and robust monitoring to avoid technical debt and ensure system reliability.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Monitoring Performance Optimization Docker deployment CPU troubleshooting JVM profiling

Written by

Efficient Ops

This public account is maintained by Xiaotianguo and friends, regularly publishing widely-read original technical articles. We focus on operations transformation and accompany you throughout your operations career, growing together happily.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.