Operations 12 min read

Why Does Linux Load Spike? Deep Dive into Load Average Calculation & Troubleshooting

During high‑traffic events like Double‑11, Linux systems often see load averages surge, affecting response times and command execution; this article explains what load averages represent, how the kernel computes them using exponential weighted moving averages, and outlines common causes and systematic methods for root‑cause analysis.

Alibaba Cloud Developer

Dec 15, 2021

Why Does Linux Load Spike? Deep Dive into Load Average Calculation & Troubleshooting

What Is Load

Linux system load averages measure the demand of tasks (processes or threads) on CPU, memory, I/O, etc., averaged over 1, 5, and 15 minutes. The values are recorded in /proc/loadavg and read by tools such as uptime and top.

If the load is close to 0, the system is idle.

If the 1‑minute average exceeds the 5‑ or 15‑minute averages, load is increasing.

If the 1‑minute average is lower than the 5‑ or 15‑minute averages, load is decreasing.

When any average exceeds the number of CPU cores, performance problems are likely.

How Load Is Calculated

Core Algorithm

The kernel uses an Exponential Weighted Moving Average (EWMA):

#define EXP_1 1884   /* 1/exp(5sec/1min) */
#define EXP_5 2014   /* 1/exp(5sec/5min) */
#define EXP_15 2037  /* 1/exp(5sec/15min) */

For each interval, the kernel updates the load with:

/*
 * a1 = a0 * e + a * (1 - e)
 */
static inline unsigned long calc_load(unsigned long load, unsigned long exp, unsigned long active)
{
    unsigned long newload;
    // FIXED_1 = 2048
    newload = load * exp + active * (FIXED_1 - exp);
    if (active >= load)
        newload += FIXED_1-1;
    return newload / FIXED_1;
}

Here a0 is the previous load, a1 the current load, e a constant (derived from the natural number e), and a the number of active (runnable + uninterruptible) tasks.

Calculation Process

The kernel performs two steps periodically:

Each CPU updates a global counter with its runnable and uninterruptible tasks.

A designated CPU (the timer CPU) computes the three load values from that counter.

The flow is illustrated below:

Common Causes of High Load

1. Periodic Spikes

Sometimes a kernel bug related to the load sampling frequency (LOAD_FREQ) causes regular spikes; this was fixed in kernels ali2016, ali3000, and ali4000.

2. I/O Issues

Disk Bottlenecks

High IOPS or bandwidth can block many threads in uninterruptible state. Tools like iostat -dx 1 and vmstat reveal elevated b (blocked) and iowait values.

Cloud Disk Anomalies

Cloud disks may show 100 % I/O utilization, indicating a persistent queue of unfinished requests, which can stall both kernel and application threads.

JBD2 Bugs

Failures in the ext4 journal daemon (jbd2) can block all disk I/O, pushing many tasks into uninterruptible state.

3. Memory Issues

Memory Reclamation

Aggressive memory reclaim can stall tasks until reclamation finishes, raising load and CPU usage.

Memory Bandwidth Contention

Beyond capacity, memory bandwidth can become a bottleneck; specialized tools (e.g., aprof) are needed to observe it.

4. Locks

Spin‑locks in critical kernel paths (especially networking) or held mutexes can cause tasks to wait in D (uninterruptible) state, inflating load.

5. User‑Space CPU

When load spikes are driven by legitimate user‑space work, you’ll see high user CPU, increased run queue length, and higher scheduler delay.

Root‑Cause Analysis Techniques

Runnable‑Type Load

Usually tied to increased business traffic or code bugs (e.g., infinite loops). On‑CPU profiling tools like perf or Alibaba’s ali-diagnose help locate hot spots.

Uninterruptible‑Type Load

Identify tasks stuck in D state via /proc/${pid}/stat (third field) and examine /proc/${pid}/stack for the waiting location. Example screenshots:

If D tasks are transient, delayed analysis using kernel probes (systemtap, kprobe, eBPF) is required; Alibaba’s ali-diagnose provides such delay analyses.

Conclusion

The Linux kernel’s load average is a concise indicator of runnable and uninterruptible task pressure. By examining both dimensions, checking I/O, memory, locking, and using appropriate tracing tools, you can reliably pinpoint the root cause of load spikes and restore system stability.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Performance kernel Troubleshooting load-average

Written by

Alibaba Cloud Developer

Alibaba's official tech channel, featuring all of its technology innovations.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.