Mastering CPU and Load: A Practical Guide to Linux Performance Troubleshooting
This article explains how to monitor and interpret CPU usage and load average on Linux servers, details the calculations behind these metrics, illustrates their meaning with examples and images, and provides step‑by‑step troubleshooting methods for high load, high CPU, and high load with low CPU scenarios.
Introduction
During traffic spikes, servers experience higher CPU usage and load; understanding these metrics is essential for preparing for large‑scale promotions and improving personal skill sets.
1. Using the top Command
The most common way to view CPU and load is with top, which refreshes every three seconds by default (the -d option changes the interval). The output includes CPU percentages, load averages, memory, and swap usage.
A detailed table (shown in the second image) highlights the meaning of each field, with important attributes marked in red and bold.
2. How CPU Usage Is Calculated
CPU statistics are gathered from /proc/stat for system‑wide values, /proc/{pid}/stat for per‑process values, and /proc/{pid}/task/{tid}/stat for per‑thread values. All numbers represent cumulative counts since system boot.
CPU usage is computed by sampling at two moments (t1 and t2):
Sum all CPU counters at t1 → s1.
Sum all CPU counters at t2 → s2.
Total CPU time for the interval = s2 – s1.
Idle time = idle2 – idle1.
CPU usage % = 100 × (totalCpuTime – idle) / totalCpuTime.
Other fields (us, sy, ni, etc.) are calculated similarly; the resulting value reflects CPU activity during the sampled period.
3. Understanding Load Average
Load average can be visualized as traffic on a single‑lane bridge: 0.00 means no cars, 1.00 means the bridge is at capacity, and values above 1.00 indicate overload (e.g., 2.00 means twice the bridge’s capacity).
Linux defines load as the average number of processes in running or uninterruptible (I/O‑waiting) state over the last 1, 5, and 15 minutes. Load is a per‑CPU metric; on a 4‑core machine, a load of 1 means 75 % of CPU capacity is idle.
Note that threads are counted individually, so a process with 1,000 active threads can produce a load of 1,000.
4. Relationship Between Request Count and Load
Many assume that a surge of requests automatically raises load, but this is incorrect. For example, Redis is single‑threaded: regardless of how many client requests arrive, only one command is processed at a time. Load rises only when the single worker thread is continuously busy.
Thus, load correlates with the number of active worker threads (main thread, timer thread, GC thread) rather than raw request volume.
5. Troubleshooting High Load & High CPU
Key principle: high CPU alone is not a problem; high CPU that causes high load is.
Identify the Java process: ps -ef | grep java.
Find the hottest thread: top -H -p <pid>.
Convert the thread’s decimal TID to hexadecimal (e.g., 2000 → 0x7d0).
Inspect the stack: jstack <pid> | grep -A 20 '0x7d0'.
Because CPU usage is an average over time while jstack captures a single instant, it’s advisable to capture several stacks (5‑10 times) and look for recurring patterns. Common culprits include non‑CPU‑consuming network I/O, tight loops, and native method calls.
6. Troubleshooting High Load & Low CPU
When load is high but CPU usage is low, the bottleneck is usually I/O. Check the wa (IO wait) column; if it’s elevated, investigate disk or network I/O.
Typical high‑load I/O sources include:
Database queries.
Redis lookups.
HTTP calls to external services (e.g., Alipay).
Dubbo RPC calls.
Understanding the system’s dependency graph helps pinpoint slow calls. Log analysis and adding monitoring points (interface name, parameters, success flag, latency) are recommended. If logs are insufficient, repeatedly printing stacks can reveal blocking I/O calls (e.g., java.net.SocketInputStream.read).
7. Common Causes of High Load in Java Applications
Infinite loops or excessive CPU‑intensive loops.
Frequent Young GC.
Frequent Full GC (often triggered by bugs or misused third‑party libraries).
High disk I/O.
High network I/O.
Usually, a newly deployed code change or an ill‑behaving third‑party JAR is the root cause. Distinguish whether load is driven by CPU or I/O and apply the appropriate investigation steps.
Conclusion
Grasping the theory behind CPU and load metrics enables clear thinking when real incidents occur. Continuous practice—simulating failures, mastering top, sar, iostat, and studying others’ troubleshooting stories—will steadily improve one’s operational expertise.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
dbaplus Community
Enterprise-level professional community for Database, BigData, and AIOps. Daily original articles, weekly online tech talks, monthly offline salons, and quarterly XCOPS&DAMS conferences—delivered by industry experts.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
