Mastering Linux Load Average: What the Numbers Really Mean
This article explains Linux Load Average’s definition, how the three numbers are calculated, their relationship with CPU and I/O, practical interpretation rules, step‑by‑step troubleshooting workflows, monitoring setups, and optimization techniques for both CPU‑bound and I/O‑bound load spikes.
Problem Background
When a server becomes slow, SSH login lags, or service response times rise, many operators first assume "high load" and check CPU usage, but most misunderstand what the three Load Average numbers actually represent and how they relate to CPU utilization.
Core Concept: What Load Average Is
1.1 Textbook Definition
Load Average is the average number of processes in the Running state and the Uninterruptible Sleep (D) state over the past 1, 5, and 15 minutes.
Key points:
Running : processes currently using the CPU or ready to run.
Uninterruptible Sleep (D) : processes waiting for I/O (disk, network) that cannot be interrupted by signals.
Average : an exponentially weighted moving average, giving more weight to recent data.
1.2 Meaning of the Three Numbers
$ uptime
10:15:32 up 45 days, 3:22, 2 users, load average: 3.52, 2.85, 2.60The three values correspond to the 1‑minute, 5‑minute, and 15‑minute averages. Each value is a count of processes, not a percentage.
3.52 (1 min): average number of processes in the last minute.
2.85 (5 min): average over the last five minutes.
2.60 (15 min): average over the last fifteen minutes.
Thus a value of 3.52 means that, on average, 3.52 processes were either running or waiting for I/O during the past minute.
1.3 How to Judge Whether Load Average Is Too High
A common rule of thumb says Load Average should not exceed the number of CPU cores, but this only applies to pure CPU‑bound workloads.
If the workload is CPU‑intensive, a Load Average close to the core count is normal (e.g., 8‑core CPU, Load ≈ 8).
If the workload is I/O‑intensive, Load Average can be far higher than the core count (e.g., 8‑core CPU, Load ≈ 50) while CPU usage stays low.
Correct assessment combines Load Average with CPU usage:
# Show CPU core count
nproc
# Show Load Average and CPU usage together
uptime
# Interpretation examples
# Load > cores && top shows low CPU → I/O bottleneck
# Load > cores && top shows high CPU → CPU bottleneck
# Load < cores && CPU idle → system is fineStep 2: Understanding the Calculation Mechanism
2.1 Kernel Calculation
The kernel computes Load Average in kernel/sched/loadavg.c using an Exponentially Weighted Moving Average (EWMA):
load(t) = a * load(t-1) + (1 - a) * n load(t): current Load Average. load(t-1): previous value. n: number of active processes (Running + Uninterruptible Sleep). a: decay factor, e.g., exp(-5/60) ≈ 0.998 for the 1‑minute window.
The kernel updates the value every 5 seconds but reports it to user space once per second.
Decay factors for the three windows:
1 min: a = exp(-5/60) = 0.9982
5 min: a = exp(-5/300) = 0.9835
15 min: a = exp(-5/900) = 0.95592.2 /proc/loadavg File
The file /proc/loadavg provides the same data in the format:
# cat /proc/loadavg
3.52 2.85 2.60 4/1234 56789First three fields: 1‑, 5‑, 15‑minute averages.
Fourth field 4/1234: running processes / total processes.
Fifth field: PID of the most recently created process (useful for spotting rapid process churn).
2.3 Uninterruptible Sleep (D State)
D state processes wait for hardware I/O and cannot be killed with signals. Typical scenarios:
1. NFS mount loses network connectivity → processes wait on NFS I/O
2. Severe disk I/O slowdown → processes wait for disk reads/writes
3. Rarely, zombie processes waiting for child exit2.4 How to View D‑State Processes
# Using top (press Shift+> or < to sort by state column)
top
# Using ps
ps aux | awk '$8 ~ /D/ {print}'
# Directly inspect /proc/<pid>/stat (field 3 shows state code)
cat /proc/1234/statMany D‑state processes usually indicate an I/O bottleneck.
Step 3: Relationship Between Load Average and CPU Usage
3.1 Load Average and CPU Info in top
top -bn1 | head -5
# Output example
# top - 10:15:32 up 45 days, 3:22, 2 users, load average: 3.52, 2.85, 2.60
# Tasks: 1234 total, 4 running, 1230 sleeping, 0 stopped, 0 zombie
# %Cpu(s): 15.2 us, 3.1 sy, 0.0 ni, 81.7 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 stInterpretation:
CPU idle 81.7 % → only 18.3 % of CPU is doing work.
Load Average 3.52 means on average 3.52 processes are either running or waiting for I/O.
If the system has 8 cores, 3.52 < 8, so CPU capacity is sufficient.
Non‑zero load despite low CPU usage indicates many processes are blocked on I/O.
3.2 iowait Is the Key Indicator
The wa field in %Cpu(s) shows the proportion of time the CPU spends waiting for I/O. High iowait (e.g., > 20 %) signals an I/O bottleneck.
# Show iowait via top
top -bn1 | grep -E '^%Cpu|^Cpu'
# Or use vmstat for a clearer view
vmstat 1 5
# Sample vmstat output (relevant columns highlighted)
# r b swpd free buff cache si so bi bo in cs us sy id wa st
# 4 2 0 8000000 500000 10000000 0 0 0 0 100 200 10 5 80 5 0
# r: running processes (≈ Load 1‑min)
# b: uninterruptible sleep processes (D state)
# wa: iowait3.3 Real‑World Analysis Example
# Simulated uptime output
load average: 12.5, 10.2, 8.0
# Simulated top output
%Cpu(s): 10.5 us, 2.1 sy, 0.0 ni, 45.0 id, 42.4 wa
# CPU core count
nproc = 8Analysis:
Load 12.5 > 8 cores → load is high.
CPU usage (us + sy) ≈ 12.6 % while idle is 45 % → CPU is not the bottleneck.
iowait 42.4 % → most of the load comes from processes waiting on I/O.
High D‑state count in /proc/loadavg confirms I/O pressure.
Next step: identify which I/O subsystem (disk, network, etc.) is slow.
# Disk I/O statistics
iostat -x 1 5
# Look for high %util, avgqu‑sz, await, etc.
# Identify top I/O consumers
iotop -oa
pidstat -d 1 5
# If iotop unavailable, fall back to ps filtering D state
ps aux | awk '$8 ~ /D/ {print}'Step 4: Complete Load Average Troubleshooting Workflow
4.1 Confirm Load Average Is Actually High
# Show core count and current load
nproc
uptime
# Rough thresholds
# Load < 0.7 × cores → likely fine
# Load > 2 × cores → serious issue4.2 Determine CPU vs I/O Bottleneck
# Quick view of CPU metrics
vmstat 1 5
# Decision matrix
# iowait > 20% && idle > 40% → I/O bottleneck
# iowait low && idle low && us high → CPU bottleneck
# iowait high && idle low → both CPU and I/O issues4.3 If CPU Bottleneck, Find Hungry Processes
# List top CPU consumers
ps aux --sort=-%cpu | head -20
# Or via top
top -bn1 -o %CPU | head -20
# Drill down for threads, child processes, etc.
ps -eLf | grep <pid> | wc -l # thread count
pstree -p <pid> # process tree
# For Java, check GC; for Python/Golang, look for loops4.4 If I/O Bottleneck, Find Hungry I/O Processes
# Show top I/O consumers
sudo iotop -oa
# If iotop missing, use pidstat
sudo pidstat -d 1 5
# Disk‑level stats
iostat -x 1 3
# Inspect per‑process I/O counters
cat /proc/<pid>/io4.5 Investigate D‑State Processes Directly
# List all D‑state processes with PID, user, command
ps aux | awk '$8 ~ /D/ {print "PID:"$2" USER:"$1" CMD:"$11}'
# Examine kernel stack to see what each process is waiting for
cat /proc/<pid>/stack
# Check open file descriptors for I/O devices
ls -la /proc/<pid>/fd
cat /proc/<pid>/fd/* 2>/dev/null | head -20Step 5: Monitoring and Alerting Configuration
5.1 Bash Script for Simple Load Alert
#!/bin/bash
# check_load.sh – run via cron every 5 minutes
THRESHOLD=$(nproc)
LOAD=$(awk '{print $1}' /proc/loadavg)
LOAD_INT=${LOAD%.*}
if [ "$LOAD_INT" -gt "$THRESHOLD" ]; then
echo "Alert: Load Average ($LOAD) > CPU cores ($THRESHOLD)" | tee -a /var/log/load_alert.log
# Integrate with DingTalk/WeChat/Email here
fi5.2 Collecting Load Average with Prometheus Node Exporter
The node_load* metrics expose the three averages. Example scrape config:
- job_name: node
static_configs:
- targets: ['localhost:9100']
relabel_configs:
- source_labels: [__address__]
target_label: instanceQuery the 1‑minute load and compare to core count:
# Current 1‑minute load
node_load1{instance="your-host"}
# Ratio of load to CPU cores
node_load1 / count(node_cpu{instance="your-host"})
# Trigger when ratio > 15.3 Prometheus Alert Rules
groups:
- name: node-load
rules:
- alert: NodeHighLoad
expr: node_load1 / count(node_cpu) > 0.8
for: 5m
labels:
severity: warning
annotations:
summary: "Node {{ $labels.instance }} load is high"
description: "Load Average 1m = {{ $value }}, CPU cores = {{ $labels.cpu }}"
- alert: NodeCriticalLoad
expr: node_load1 / count(node_cpu) > 1.2
for: 2m
labels:
severity: criticalStep 6: Load Average Optimization Practices
6.1 CPU Bottleneck Optimization
# Find top CPU consumers
ps aux --sort=-%cpu | head -20
# Horizontal scaling: add more processes/containers
# Vertical optimization: improve algorithms, reduce unnecessary work
# Profile hotspots and rewrite critical sections
# For system processes, check interrupt rates (cat /proc/interrupts) and context switches (vmstat 1)
# Excessive context switches → investigate with pidstat -w6.2 I/O Bottleneck Optimization
# Identify top I/O consumers
sudo iotop -oa
# If many small file writes → batch writes or use SSD/tmpfs
# Adjust I/O scheduler (e.g., noop for SSD, mq-deadline or bfq for HDD)
cat /sys/block/sda/queue/scheduler
# Temporary change (lost on reboot)
echo none | sudo tee /sys/block/sda/queue/scheduler
# Permanent change via GRUB
# GRUB_CMDLINE_LINUX="elevator=none"
# If swap activity is high → add RAM or limit memory usage
free -m
swapon -s6.3 Network I/O Bottleneck
# Check network traffic
sar -n DEV 1 5
# Count TCP states
netstat -an | awk '/^tcp/ {print $6}' | sort | uniq -c
# Excessive TIME_WAIT connections → enable tcp_tw_reuse, adjust tcp_max_tw_buckets, shorten timeoutConclusion
Understanding Linux Load Average is essential because it measures the number of "busy" processes—not CPU utilization. The three numbers reflect short‑term, medium‑term, and long‑term load trends. Properly judging high load requires comparing the values to CPU core count, examining CPU idle versus iowait, and investigating D‑state processes. Load Average is a symptom; the real cause is either CPU, I/O, network, or memory pressure, which must be identified and resolved.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
MaGe Linux Operations
Founded in 2009, MaGe Education is a top Chinese high‑end IT training brand. Its graduates earn 12K+ RMB salaries, and the school has trained tens of thousands of students. It offers high‑pay courses in Linux cloud operations, Python full‑stack, automation, data analysis, AI, and Go high‑concurrency architecture. Thanks to quality courses and a solid reputation, it has talent partnerships with numerous internet firms.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
