Operations 14 min read

How to Resolve 100% CPU Outages in Under 3 Minutes: A Real‑World Emergency Guide

This article walks through a real‑world 100% CPU incident on an e‑commerce platform, showing how to detect the problem within seconds, analyze Java threads, apply quick emergency fixes, implement permanent refactoring, and set up long‑term monitoring to prevent future outages.

MaGe Linux Operations
MaGe Linux Operations
MaGe Linux Operations
How to Resolve 100% CPU Outages in Under 3 Minutes: A Real‑World Emergency Guide

🚨 Online CPU 100% Failure Emergency Handling: The Ultimate Guide to Locate the Problem in 3 Minutes

Real‑case background : At 02:00 AM the monitoring alarm rang, the e‑commerce site slowed down, and CPU usage hit 100 %. You have three minutes to find the root cause or suffer huge business loss.

As an ops engineer with eight years of experience, I share a typical CPU‑100% troubleshooting process.

💡 Symptom: Sudden Drop in User Experience

Timeline :

02:15 – Monitoring alarm: CPU > 95 % continuously.

02:16 – User feedback: page load > 10 s.

02:17 – Operations notice: order volume plummets.

02:18 – Emergency investigation starts.

Key metric anomaly :

# System load abnormally high
load average: 8.5, 7.2, 6.8
# Normal should be below 2
# CPU usage
%Cpu(s): 98.2 us, 1.2 sy, 0.0 ni, 0.6 id
# Memory usage is normal
KiB Mem : 16GB total, 2GB free

🔍 Step 1 – Quickly Identify the CPU Consumer (within 30 seconds)

Use top for initial inspection

# Sort by CPU usage, refresh in real time
top -o %CPU

# Sample output
PID    USER   PR  NI   VIRT   RES   SHR S %CPU %MEM   TIME+ COMMAND
12847  www    20   0   2.2g   1.8g  12m R 89.5 11.2   145:32 java
8934   mysql  20   0   1.6g   800m  32m S  8.2  5.1    23:45 mysqld
3421   nginx  20   0   128m    45m   8m S  1.2  0.3     2:34 nginx

Key finding : Java process (PID 12847) consumes 89.5 % CPU.

Drill into Java threads

# Show threads of the Java process
top -H -p 12847

# Important threads
PID    USER   PR  NI   VIRT   RES   SHR S %CPU %MEM   TIME+ COMMAND
12851  www    20   0   2.2g   1.8g  12m R 45.2 11.2   89:23 java
12856  www    20   0   2.2g   1.8g  12m R 44.3 11.2   78:45 java
12863  www    20   0   2.2g   1.8g  12m S  2.1 11.2    5:34 java

Important clue : Threads 12851 and 12856 together consume almost 90 % CPU.

💻 Step 2 – Pinpoint Problem Code (within 2 minutes)

Obtain Java thread stack traces

# Convert thread IDs to hex (used in Java stack)
printf "0x%x
" 12851   # → 0x3233
printf "0x%x
" 12856   # → 0x3238

# Dump full Java stack
jstack 12847 > /tmp/java_stack.txt

# Search for the threads
grep -A 20 "0x3233" /tmp/java_stack.txt

Stack analysis result

"pool-2-thread-1" #23 prio=5 os_prio=0 nid=0x3233 runnable
    at com.company.service.OrderService.calculateDiscount(OrderService.java:245)
    at com.company.service.OrderService.processOrder(OrderService.java:189)
    at com.company.controller.OrderController.submitOrder(OrderController.java:67)
    - locked <0x000000076ab62208> (a java.lang.Object)

"pool-2-thread-2" #24 prio=5 os_prio=0 nid=0x3238 runnable
    at com.company.service.OrderService.calculateDiscount(OrderService.java:245)
    - waiting to lock <0x000000076ab62208> (a java.lang.Object)

Problem located at OrderService.calculateDiscount line 245.

Lock contention: multiple threads compete for the same lock.

Thread state shows RUNNABLE but actually waiting on a lock.

⚡ Step 3 – Emergency Fix (within 1 minute)

Temporary solution: rate‑limit + cache

# 1. Restart the application (if short downtime is acceptable)
systemctl restart your-app

# 2. Enable Nginx rate limiting
limit_req_zone $binary_remote_addr zone=order:10m rate=10r/s;

location /api/order {
    limit_req zone=order burst=20 nodelay;
    proxy_pass http://backend;
}

# Reload Nginx configuration
nginx -s reload

# 3. Temporarily disable coupon validation (business downgrade)
curl -X PUT http://config-center/api/features/coupon-validation \
    -d '{"enabled": false}'

🛠️ Step 4 – Permanent Fix

Refactor: async + fine‑grained lock

@Service
public class OrderService {
    private final RedisTemplate<String, Object> redisTemplate;
    private final CouponValidationService couponService;

    // Remove synchronized, use Redis distributed lock
    public CompletableFuture<BigDecimal> calculateDiscountAsync(Order order) {
        return CompletableFuture.supplyAsync(() -> {
            String lockKey = "discount_calc_" + order.getUserId();
            return redisTemplate.execute(connection -> {
                try {
                    Boolean lockAcquired = connection.setNX(lockKey.getBytes(), "1".getBytes());
                    connection.expire(lockKey.getBytes(), 5); // 5 s expiration
                    if (lockAcquired) {
                        return doCalculateDiscount(order);
                    } else {
                        return getDefaultDiscount(order);
                    }
                } finally {
                    connection.del(lockKey.getBytes());
                }
            });
        });
    }

    private BigDecimal doCalculateDiscount(Order order) {
        // 1. Check cache
        String cacheKey = "discount_" + order.getCouponCode();
        BigDecimal cached = (BigDecimal) redisTemplate.opsForValue().get(cacheKey);
        if (cached != null) return cached;

        // 2. Async call to third‑party API with timeout
        CompletableFuture<CouponValidationResult> apiCall =
            couponService.validateCouponAsync(order.getCouponCode())
                .orTimeout(2, TimeUnit.SECONDS)
                .exceptionally(ex -> {
                    log.warn("Coupon validation timeout, using default", ex);
                    return CouponValidationResult.defaultResult();
                });
        try {
            CouponValidationResult result = apiCall.get();
            BigDecimal discount = calculateFinalDiscount(result, order);
            // 3. Cache result
            redisTemplate.opsForValue().set(cacheKey, discount, Duration.ofMinutes(10));
            return discount;
        } catch (Exception e) {
            log.error("Discount calculation error", e);
            return getDefaultDiscount(order);
        }
    }
}

Performance monitoring improvement

@Around("@annotation(Timed)")
public Object logExecutionTime(ProceedingJoinPoint joinPoint) throws Throwable {
    long start = System.currentTimeMillis();
    Object proceed = joinPoint.proceed();
    long exec = System.currentTimeMillis() - start;
    if (exec > 1000) {
        log.warn("Method execution too long: {} ms, method: {}", exec, joinPoint.getSignature());
    }
    return proceed;
}

📊 Step 5 – Effect Verification & Long‑Term Monitoring

Before‑after comparison

Key indicators dropped dramatically after the fix: CPU usage from 98 % to 25 % (‑73 %), response time from 8‑12 s to 200‑500 ms (‑95 %), and throughput from 10 TPS to 200 TPS (↑1900 %). System load fell from 8.5 to 1.2 (‑86 %).

Establish alert rules

# Prometheus alert rules
groups:
- name: cpu_alerts
  rules:
  - alert: HighCPUUsage
    expr: cpu_usage_percent > 80
    for: 2m
    annotations:
      summary: "Server CPU usage too high"
      description: "CPU usage reached {{ $value }}% for more than 2 minutes"

  - alert: JavaThreadBlocked
    expr: jvm_threads_blocked_count > 10
    for: 1m
    annotations:
      summary: "Java thread blockage abnormal"
      description: "Blocked thread count: {{ $value }}"

💰 Business Impact Summary

Incident handling time : reduced from ~30 minutes to 3 minutes.

User experience : page response dropped from 10 s to 0.5 s.

Business loss avoided : estimated 500 k CNY per hour of orders saved.

🎯 Experience Summary: 5 Golden Rules for Ops Engineers

1. Build layered monitoring

# System layer: CPU/Memory/Disk/Network, load average, process state
# Application layer: JVM heap, GC, thread state, API latency, error rate, TPS
# Business layer: key business metrics, user behavior anomalies

2. Master rapid‑diagnosis toolchain

# CPU troubleshooting triad
top → jstack → code analysis

# Common commands
ps aux | grep java   # find Java process
top -H -p <pid>      # view threads
jstack <pid> | grep -A 10   # analyze thread stack

3. Standardize emergency playbook

2 min – confirm problem and initial定位

5 min – apply temporary solution

30 min – root‑cause analysis and permanent fix

1 h – post‑mortem and preventive measures

4. Emphasize code performance review

Lock usage principle: minimize granularity and hold time.

Asynchronous refactor: move time‑consuming operations out of locks.

Cache strategy: multi‑level cache to avoid repeated calculations.

5. Build knowledge base and toolbox

Maintain a fault‑case library with diagnosis steps.

Automate diagnostic and remediation scripts.

Provide visual monitoring dashboards.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

JavaOperationsPerformance Monitoringincident responseCPU
MaGe Linux Operations
Written by

MaGe Linux Operations

Founded in 2009, MaGe Education is a top Chinese high‑end IT training brand. Its graduates earn 12K+ RMB salaries, and the school has trained tens of thousands of students. It offers high‑pay courses in Linux cloud operations, Python full‑stack, automation, data analysis, AI, and Go high‑concurrency architecture. Thanks to quality courses and a solid reputation, it has talent partnerships with numerous internet firms.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.