Backend Development 7 min read

From CPU Alert to Resolution: A Step‑by‑Step Backend Performance Debugging Guide

This article recounts a midnight CPU alert incident and walks through systematic backend troubleshooting—from initial system checks and JVM profiling to algorithm refactoring, database indexing, Docker‑based isolation, and proactive monitoring—demonstrating how to restore service performance and prevent future outages.

DevOps Operations Practice
DevOps Operations Practice
DevOps Operations Practice
From CPU Alert to Resolution: A Step‑by‑Step Backend Performance Debugging Guide

Preface

At 4 a.m. I was awakened by a harsh phone alarm indicating an online CPU alert. As the person in charge of the core system, I immediately realized the potential impact on user experience, data loss, and my year‑end performance evaluation.

1. Initial Diagnosis: Quick Problem Localization

I logged into the server and ran $ top to view system resource usage. The output showed CPU usage near 100 % and a load average far exceeding the number of CPU cores.

Next, I used $ htop for more detailed process information and discovered several Java processes consuming a large amount of CPU; these were our core services.

2. JVM‑Level Analysis: Finding Hot Methods

After confirming the issue lay in the Java application, I inspected the JVM. I ran $ jstat -gcutil [PID] 1000 10 , which revealed frequent Full GCs that could be driving the high CPU usage.

I then generated a thread dump with $ jstack [PID] > thread_dump.txt . The dump showed many threads in RUNNABLE state executing similar method calls.

To pinpoint the hot method, I used async‑profiler:

$ ./profiler.sh -d 30 -f cpu_profile.svg [PID]

The resulting flame graph clearly indicated a custom sorting algorithm consuming most of the CPU time.

3. Application‑Level Optimization: Refactoring the Algorithm

Identifying the culprit, I examined the code of the custom sorting algorithm, which was designed for small data sets and did not scale.

I refactored it to use Java 8 parallel streams:

List<Data> sortedData = data.parallelStream()
    .sorted(Comparator.comparing(Data::getKey))
    .collect(Collectors.toList());

I also added a caching layer to avoid repeated calculations:

@Cacheable("sortedData")
public List<Data> getSortedData() {
    // Optimized sorting logic
}

4. Database Optimization: Index and Query Improvements

During the investigation I found inefficient database queries. Using EXPLAIN SELECT * FROM large_table WHERE status='ACTIVE'; showed a full table scan.

I created an appropriate index:

CREATE INDEX idx_status ON large_table(status);

and rewrote some ORM queries to use native SQL for better performance:

@Query(value = "SELECT * FROM large_table WHERE status = :status", nativeQuery = true)
List<LargeTable> findByStatus(@Param("status") String status);

5. Deployment Optimization: Resource Isolation

To prevent a single service from affecting the whole system, I containerized the application with Docker. The Dockerfile looks like:

FROM openjdk:11-jre-slim
COPY target/ myapp.jar app.jar
ENTRYPOINT ["java", "-Xmx2g", "-jar", "/app.jar"]

I then used Docker Compose to limit CPU and memory resources:

version: '3'
services:
  myapp:
    build: .
    deploy:
      resources:
        limits:
          cpus: '0.50'
          memory: 512M

6. Monitoring & Alerting: Proactive Prevention

Finally, I upgraded the monitoring stack with Prometheus and Grafana and added a smarter alert rule:

- alert: HighCPUUsage
  expr: 100 - (avg by(instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 80
  for: 5m
  labels:
    severity: warning
  annotations:
    summary: "High CPU usage detected"
    description: "CPU usage is above 80% for more than 5 minutes"

Conclusion: Crisis and Growth

After nearly four hours of effort, the system returned to normal with CPU usage below 30 % and response times back to the millisecond level. The incident highlighted the importance of regular code reviews, performance testing, pressure testing, and robust monitoring to avoid technical debt and ensure system reliability.

JavaJVMmonitoringPerformanceDockerdatabase
DevOps Operations Practice
Written by

DevOps Operations Practice

We share professional insights on cloud-native, DevOps & operations, Kubernetes, observability & monitoring, and Linux systems.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.