Operations 11 min read

Why High Load Doesn’t Always Mean High CPU: Decoding Linux Load Average & CPU Metrics

Understanding Linux load average and CPU utilization, this guide explains process states, how load is calculated, the difference between load and CPU usage, common bottlenecks, and step‑by‑step troubleshooting techniques using tools like top, vmstat, pidstat, iostat, and perf.

Alibaba Cloud Developer
Alibaba Cloud Developer
Alibaba Cloud Developer
Why High Load Doesn’t Always Mean High CPU: Decoding Linux Load Average & CPU Metrics

Background Knowledge

Linux processes after kernel 2.6 have seven basic states: D (uninterruptible sleep), R (running), S (interruptible sleep), T (stopped), t (traced), X (dead), Z (zombie). These states correspond to the columns shown by the ps command.

D (TASK_UNINTERRUPTIBLE) : Uninterruptible sleep, usually caused by I/O wait; cannot be killed with kill -9 and does not consume CPU.

R (TASK_RUNNING) : Runnable or running on the CPU.

S (TASK_INTERRUPTIBLE) : Interruptible sleep, waiting for events such as socket connections or semaphores; also does not consume CPU.

T/t (__TASK_STOPPED & __TASK_TRACED) : Stopped (signal‑induced) or traced (debugger‑induced) state; resources are released.

Z (EXIT_ZOMBIE) : Process has exited but the parent has not yet reaped it.

X (EXIT_DEAD) : Final dead state, rarely observed.

Load Average & CPU Utilization

Load average and CPU usage are the two most intuitive performance metrics, but they are calculated differently and are not equivalent.

Load Average

Many assume load represents the number of processes running or waiting for CPU, but in Linux the calculation also includes processes in uninterruptible sleep (I/O wait). The kernel source shows that both TASK_RUNNING and TASK_UNINTERRUPTIBLE contribute to the load count.

Therefore, Linux load average reflects overall system load: CPU + disk I/O + network I/O + other device I/O, and cannot be equated with CPU utilization alone.

CPU Utilization

CPU time is divided into four main categories: user time, system time, idle time, and steal time. Overall CPU utilization is typically the sum of user and system time.

Performance tools further split these into eight categories (as shown by top): us, sy, ni, id, wa, hi, si, st.

Resource & Bottleneck Analysis

Different combinations of load and CPU metrics point to distinct bottlenecks:

High Load & High CPU : Load increase is driven by CPU load. Sub‑cases include:

High Load & Low CPU : Many processes are in uninterruptible sleep (I/O bound). Identify whether disk I/O or network I/O is the cause.

Investigation Strategy

The troubleshooting workflow consists of four stages:

Resource Bottleneck Location : Use global performance tools (top, vmstat, tsar) to spot abnormal resource consumption. Also inspect /proc/softirqs, /proc/interrupts, iostat, dstat.

Hot Process Identification : After locating the bottleneck, find the offending processes with pidstat -w, pidstat -u, iotop, pidstat -d, or ps for zombies.

Thread & Process Internal Resource Location : Drill down into a specific PID using pidstat -w -p [pid], pidstat -u -p [pid], or lsof for I/O.

Hot Event & Method Analysis : Capture stack traces with perf, jstack, strace, or network captures with tcpdump to pinpoint the exact code path.

Key Tools

top, vmstat, tsar (historical)

/proc/softirqs, /proc/interrupts

iostat, dstat

pidstat (‑w, ‑u, ‑d, ‑p)

perf, jstack, strace, tcpdump

static unsigned long count_active_tasks(void)
{
    struct task_struct *p;
    unsigned long nr = 0;
    read_lock(&tasklist_lock);
    for_each_task(p) {
        if ((p->state == TASK_RUNNING) || (p->state & TASK_UNINTERRUPTIBLE))
            nr += FIXED_1;
    }
    read_unlock(&tasklist_lock);
    return nr;
}

static inline void calc_load(unsigned long ticks)
{
    unsigned long active_tasks; /* fixed‑point */
    static int count = LOAD_FREQ;
    count -= ticks;
    if (count < 0) {
        count += LOAD_FREQ;
        active_tasks = count_active_tasks();
        CALC_LOAD(avenrun[0], EXP_1, active_tasks);
        CALC_LOAD(avenrun[1], EXP_5, active_tasks);
        CALC_LOAD(avenrun[2], EXP_15, active_tasks);
    }
}

By correlating the metrics and using the above tools, engineers can quickly pinpoint whether a performance issue stems from CPU saturation, I/O wait, network bottlenecks, or misbehaving application code.

Performance MonitoringLinuxcpu-utilizationsystem diagnostics
Alibaba Cloud Developer
Written by

Alibaba Cloud Developer

Alibaba's official tech channel, featuring all of its technology innovations.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.