How to Resolve 100% CPU Outages in Under 3 Minutes: A Real‑World Emergency Guide
This article walks through a real‑world 100% CPU incident on an e‑commerce platform, showing how to detect the problem within seconds, analyze Java threads, apply quick emergency fixes, implement permanent refactoring, and set up long‑term monitoring to prevent future outages.
🚨 Online CPU 100% Failure Emergency Handling: The Ultimate Guide to Locate the Problem in 3 Minutes
Real‑case background : At 02:00 AM the monitoring alarm rang, the e‑commerce site slowed down, and CPU usage hit 100 %. You have three minutes to find the root cause or suffer huge business loss.
As an ops engineer with eight years of experience, I share a typical CPU‑100% troubleshooting process.
💡 Symptom: Sudden Drop in User Experience
Timeline :
02:15 – Monitoring alarm: CPU > 95 % continuously.
02:16 – User feedback: page load > 10 s.
02:17 – Operations notice: order volume plummets.
02:18 – Emergency investigation starts.
Key metric anomaly :
# System load abnormally high
load average: 8.5, 7.2, 6.8
# Normal should be below 2
# CPU usage
%Cpu(s): 98.2 us, 1.2 sy, 0.0 ni, 0.6 id
# Memory usage is normal
KiB Mem : 16GB total, 2GB free🔍 Step 1 – Quickly Identify the CPU Consumer (within 30 seconds)
Use top for initial inspection
# Sort by CPU usage, refresh in real time
top -o %CPU
# Sample output
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
12847 www 20 0 2.2g 1.8g 12m R 89.5 11.2 145:32 java
8934 mysql 20 0 1.6g 800m 32m S 8.2 5.1 23:45 mysqld
3421 nginx 20 0 128m 45m 8m S 1.2 0.3 2:34 nginxKey finding : Java process (PID 12847) consumes 89.5 % CPU.
Drill into Java threads
# Show threads of the Java process
top -H -p 12847
# Important threads
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
12851 www 20 0 2.2g 1.8g 12m R 45.2 11.2 89:23 java
12856 www 20 0 2.2g 1.8g 12m R 44.3 11.2 78:45 java
12863 www 20 0 2.2g 1.8g 12m S 2.1 11.2 5:34 javaImportant clue : Threads 12851 and 12856 together consume almost 90 % CPU.
💻 Step 2 – Pinpoint Problem Code (within 2 minutes)
Obtain Java thread stack traces
# Convert thread IDs to hex (used in Java stack)
printf "0x%x
" 12851 # → 0x3233
printf "0x%x
" 12856 # → 0x3238
# Dump full Java stack
jstack 12847 > /tmp/java_stack.txt
# Search for the threads
grep -A 20 "0x3233" /tmp/java_stack.txtStack analysis result
"pool-2-thread-1" #23 prio=5 os_prio=0 nid=0x3233 runnable
at com.company.service.OrderService.calculateDiscount(OrderService.java:245)
at com.company.service.OrderService.processOrder(OrderService.java:189)
at com.company.controller.OrderController.submitOrder(OrderController.java:67)
- locked <0x000000076ab62208> (a java.lang.Object)
"pool-2-thread-2" #24 prio=5 os_prio=0 nid=0x3238 runnable
at com.company.service.OrderService.calculateDiscount(OrderService.java:245)
- waiting to lock <0x000000076ab62208> (a java.lang.Object)Problem located at OrderService.calculateDiscount line 245.
Lock contention: multiple threads compete for the same lock.
Thread state shows RUNNABLE but actually waiting on a lock.
⚡ Step 3 – Emergency Fix (within 1 minute)
Temporary solution: rate‑limit + cache
# 1. Restart the application (if short downtime is acceptable)
systemctl restart your-app
# 2. Enable Nginx rate limiting
limit_req_zone $binary_remote_addr zone=order:10m rate=10r/s;
location /api/order {
limit_req zone=order burst=20 nodelay;
proxy_pass http://backend;
}
# Reload Nginx configuration
nginx -s reload
# 3. Temporarily disable coupon validation (business downgrade)
curl -X PUT http://config-center/api/features/coupon-validation \
-d '{"enabled": false}'🛠️ Step 4 – Permanent Fix
Refactor: async + fine‑grained lock
@Service
public class OrderService {
private final RedisTemplate<String, Object> redisTemplate;
private final CouponValidationService couponService;
// Remove synchronized, use Redis distributed lock
public CompletableFuture<BigDecimal> calculateDiscountAsync(Order order) {
return CompletableFuture.supplyAsync(() -> {
String lockKey = "discount_calc_" + order.getUserId();
return redisTemplate.execute(connection -> {
try {
Boolean lockAcquired = connection.setNX(lockKey.getBytes(), "1".getBytes());
connection.expire(lockKey.getBytes(), 5); // 5 s expiration
if (lockAcquired) {
return doCalculateDiscount(order);
} else {
return getDefaultDiscount(order);
}
} finally {
connection.del(lockKey.getBytes());
}
});
});
}
private BigDecimal doCalculateDiscount(Order order) {
// 1. Check cache
String cacheKey = "discount_" + order.getCouponCode();
BigDecimal cached = (BigDecimal) redisTemplate.opsForValue().get(cacheKey);
if (cached != null) return cached;
// 2. Async call to third‑party API with timeout
CompletableFuture<CouponValidationResult> apiCall =
couponService.validateCouponAsync(order.getCouponCode())
.orTimeout(2, TimeUnit.SECONDS)
.exceptionally(ex -> {
log.warn("Coupon validation timeout, using default", ex);
return CouponValidationResult.defaultResult();
});
try {
CouponValidationResult result = apiCall.get();
BigDecimal discount = calculateFinalDiscount(result, order);
// 3. Cache result
redisTemplate.opsForValue().set(cacheKey, discount, Duration.ofMinutes(10));
return discount;
} catch (Exception e) {
log.error("Discount calculation error", e);
return getDefaultDiscount(order);
}
}
}Performance monitoring improvement
@Around("@annotation(Timed)")
public Object logExecutionTime(ProceedingJoinPoint joinPoint) throws Throwable {
long start = System.currentTimeMillis();
Object proceed = joinPoint.proceed();
long exec = System.currentTimeMillis() - start;
if (exec > 1000) {
log.warn("Method execution too long: {} ms, method: {}", exec, joinPoint.getSignature());
}
return proceed;
}📊 Step 5 – Effect Verification & Long‑Term Monitoring
Before‑after comparison
Key indicators dropped dramatically after the fix: CPU usage from 98 % to 25 % (‑73 %), response time from 8‑12 s to 200‑500 ms (‑95 %), and throughput from 10 TPS to 200 TPS (↑1900 %). System load fell from 8.5 to 1.2 (‑86 %).
Establish alert rules
# Prometheus alert rules
groups:
- name: cpu_alerts
rules:
- alert: HighCPUUsage
expr: cpu_usage_percent > 80
for: 2m
annotations:
summary: "Server CPU usage too high"
description: "CPU usage reached {{ $value }}% for more than 2 minutes"
- alert: JavaThreadBlocked
expr: jvm_threads_blocked_count > 10
for: 1m
annotations:
summary: "Java thread blockage abnormal"
description: "Blocked thread count: {{ $value }}"💰 Business Impact Summary
Incident handling time : reduced from ~30 minutes to 3 minutes.
User experience : page response dropped from 10 s to 0.5 s.
Business loss avoided : estimated 500 k CNY per hour of orders saved.
🎯 Experience Summary: 5 Golden Rules for Ops Engineers
1. Build layered monitoring
# System layer: CPU/Memory/Disk/Network, load average, process state
# Application layer: JVM heap, GC, thread state, API latency, error rate, TPS
# Business layer: key business metrics, user behavior anomalies2. Master rapid‑diagnosis toolchain
# CPU troubleshooting triad
top → jstack → code analysis
# Common commands
ps aux | grep java # find Java process
top -H -p <pid> # view threads
jstack <pid> | grep -A 10 # analyze thread stack3. Standardize emergency playbook
2 min – confirm problem and initial定位
5 min – apply temporary solution
30 min – root‑cause analysis and permanent fix
1 h – post‑mortem and preventive measures
4. Emphasize code performance review
Lock usage principle: minimize granularity and hold time.
Asynchronous refactor: move time‑consuming operations out of locks.
Cache strategy: multi‑level cache to avoid repeated calculations.
5. Build knowledge base and toolbox
Maintain a fault‑case library with diagnosis steps.
Automate diagnostic and remediation scripts.
Provide visual monitoring dashboards.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
MaGe Linux Operations
Founded in 2009, MaGe Education is a top Chinese high‑end IT training brand. Its graduates earn 12K+ RMB salaries, and the school has trained tens of thousands of students. It offers high‑pay courses in Linux cloud operations, Python full‑stack, automation, data analysis, AI, and Go high‑concurrency architecture. Thanks to quality courses and a solid reputation, it has talent partnerships with numerous internet firms.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
