Fundamentals 19 min read

How to Diagnose and Fix 100% CPU Overload with Smart Scheduling

This guide explains how CPU scheduling works, why 100% CPU usage occurs, and provides a step‑by‑step troubleshooting workflow—including monitoring with top/vmstat, identifying offending threads, analyzing stack traces, and applying both quick‑fix and long‑term remediation techniques—to keep systems stable.

NiuNiu MaTe
NiuNiu MaTe
NiuNiu MaTe
How to Diagnose and Fix 100% CPU Overload with Smart Scheduling

Think of a computer as a 24‑hour factory where the CPU is the production floor and the CPU scheduler is the foreman who assigns work to the assembly lines. When many tasks compete for CPU time, the floor can become overloaded, leading to sluggish performance or complete system stalls.

Station 1 – When the CPU Gets Overloaded

Opening several applications simultaneously (browser, IDE, video call, download manager) can push CPU usage to 100%, turning the cursor into a spinning wheel and causing input lag. This is essentially a "computing resource battle" where every program is an order waiting to be processed.

Station 2 – Core Mission of the Scheduler

The scheduler aims to keep the floor running smoothly by balancing three goals:

Fairness : Every task gets a chance to use the CPU, preventing background jobs from starving interactive ones.

Efficiency : The scheduler avoids idle lines by immediately assigning waiting tasks to free cores.

Responsiveness : High‑priority actions (e.g., keyboard input) are scheduled before low‑priority background work.

Even with a perfect scheduler, the CPU can still become chaotic due to problematic tasks.

Station 3 – When the Scheduler Fails

Four typical issues cause the scheduler to break down:

Tasks that never release resources – infinite loops or unbounded memory allocation.

Lock contention and deadlocks – two threads each hold a lock the other needs.

Memory‑leak‑induced GC storms – excessive garbage‑collection cycles consume CPU.

Scheduling rule blind spots – priority inversion and excessive context switches.

Station 4 – Three‑Step Troubleshooting Workflow

Step 1: Observe the anomaly

Business‑side symptoms: UI freezes, API timeouts, message‑queue backlog.

System‑side metrics: run top and look at %Cpu(s), load average, and per‑process %CPU columns.

Example top output highlights a Java process with %CPU > 120%.

Key indicators to watch: %Cpu(s): user (us) + system (sy) ≈ 100% and idle (id) ≈ 0. load average: value far exceeding core count signals a queue.

Process list: high %CPU values point to the culprit.

Run vmstat and focus on the r column (runnable tasks). Values much larger than core count indicate overload.

Step 2: Pinpoint the offending thread

Once a high‑CPU process is identified (e.g., PID = 12345), list its threads: top -Hp 12345 Find the thread with the highest CPU share (e.g., PID = 12350, 90%). Convert the thread ID to hex for further analysis: printf "%x\n" 12350 Use jstack (for Java) to dump the stack trace of that thread.

Step 3: Trace the problem back to code

Analyze the stack trace for three common patterns:

Infinite loop : method without exit condition.

Blocking : lines containing waiting for monitor entry indicate lock contention.

GC storm : frequent GC task thread entries show aggressive garbage collection.

For GC‑related issues, monitor with jstat and look at FGC (full GC count) and FGCT (full GC time). A jump from 30 to 31 FGC in one second with FGCT ≈ 1 s signals a storm.

Use jmap to inspect object distribution; a huge number of OrderDTO instances indicates a memory leak feeding the GC.

Station 5 – Immediate and Long‑Term Remedies

Quick‑fix (within 5 minutes)

Save thread dump: kill -3 12345 Force‑kill the offending process: kill -9 12345 Restart the service, e.g.,

nohup java -jar app.jar &

Prioritize critical workloads

Raise priority of key processes: renice -n -20 67890 Limit runaway processes with cgroups: create a group cpu_limit and cap its CPU share to 50% before adding the PID.

Long‑term fixes

Infinite loops : add timeout guards or watchdog timers.

Lock contention : replace coarse‑grained locks with finer‑grained or lock‑free structures.

GC storms : use caches with expiration, avoid long‑lived static collections, and regularly profile heap usage.

Excessive context switches : size thread pools to roughly the number of CPU cores (core ± 1) to prevent over‑subscription.

Conclusion – The Ultimate CPU‑Scheduling Mindset

High CPU usage is rarely the scheduler’s fault; it reflects tasks that exceed the scheduler’s capacity. A robust strategy combines proactive monitoring (alert at 80% CPU), disciplined code (timeouts, proper locking, bounded caches), and thorough load testing (e.g., JMeter at 10× traffic) to keep the "factory" running smoothly.

LinuxSystem monitoringCPU schedulingPerformance debugging
NiuNiu MaTe
Written by

NiuNiu MaTe

Joined Tencent (nicknamed "Goose Factory") through campus recruitment at a second‑tier university. Career path: Tencent → foreign firm → ByteDance → Tencent. Started as an interviewer at the foreign firm and hopes to help others.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.