Operations 11 min read

How Linux PSI Quantifies Resource Bottlenecks and Boosts Performance

This article explains Linux's Pressure Stall Information (PSI) mechanism, its /proc and cgroup interfaces, how to monitor CPU, memory, and I/O pressure, and presents code‑level optimizations to reduce PSI overhead and improve system performance.

ByteDance SYS Tech

Dec 30, 2022

How Linux PSI Quantifies Resource Bottlenecks and Boosts Performance

Background

Understanding operating system principles, the performance of a business process depends on the allocation of various system resources such as CPU, memory, and I/O. A process repeatedly waits for resources and executes, and excessive waiting harms throughput and latency. PSI quantifies these waiting times.

Purpose

PSI provides a real‑time metric that reflects business process throughput and latency, helping identify resource bottlenecks, guide deployment density, and dynamically adjust resource allocation to meet performance requirements.

Monitoring Interfaces

PSI exports three files under /proc/pressure that report system‑level CPU, memory, and I/O pressure.

# cat /proc/pressure/cpu
some avg10=0.00 avg60=0.00 avg300=0.00 total=0
full avg10=0.00 avg60=0.00 avg300=0.00 total=0
# cat /proc/pressure/memory
some avg10=0.00 avg60=0.00 avg300=0.00 total=0
full avg10=0.00 avg60=0.00 avg300=0.00 total=0
# cat /proc/pressure/io
some avg10=0.00 avg60=0.00 avg300=0.00 total=0
full avg10=0.00 avg60=0.00 avg300=0.00 total=0

some indicates the proportion of time at least one thread experiences a resource bottleneck; full indicates the proportion of time all threads are stalled. avg10 , avg60 , and avg300 are moving‑average percentages over 10 s, 60 s, and 300 s windows, while total is the accumulated stall time.

Trigger Interface

Besides reading these files, writing to them registers a trigger; programs can then wait for events using select(), poll() or epoll().

#include <errno.h>
#include <fcntl.h>
#include <stdio.h>
#include <poll.h>
#include <string.h>
#include <unistd.h>

/* Monitor memory partial stall with 1s tracking window size and 150ms threshold. */
int main() {
    const char trig[] = "some 150000 1000000";
    struct pollfd fds;
    int n;

    fds.fd = open("/proc/pressure/memory", O_RDWR | O_NONBLOCK);
    if (fds.fd < 0) {
        printf("/proc/pressure/memory open error: %s
", strerror(errno));
        return 1;
    }
    fds.events = POLLPRI;

    if (write(fds.fd, trig, strlen(trig) + 1) < 0) {
        printf("/proc/pressure/memory write error: %s
", strerror(errno));
        return 1;
    }

    printf("waiting for events...
");
    while (1) {
        n = poll(&fds, 1, -1);
        if (n < 0) {
            printf("poll error: %s
", strerror(errno));
            return 1;
        }
        if (fds.revents & POLLERR) {
            printf("got POLLERR, event source is gone
");
            return 0;
        }
        if (fds.revents & POLLPRI) {
            printf("event triggered!
");
        } else {
            printf("unknown event received: 0x%x
", fds.revents);
            return 1;
        }
    }
    return 0;
}

cgroup Interface

When using cgroup‑v2, PSI also provides per‑cgroup pressure files: cpu.pressure, memory.pressure, and io.pressure inside each cgroup directory. They are read and written in the same way as the /proc/pressure files.

Performance Optimization

While PSI offers valuable real‑time metrics, its hooks add overhead by instrumenting scheduler, I/O, and memory paths. To keep PSI enabled in production, the overhead must be mitigated.

Code Analysis

psi_task_change()

is invoked at every PSI hook point (e.g., task start/stop waiting for CPU or memory) and also updates each cgroup the task belongs to, leading to high call frequency and cost.

void psi_task_change(struct task_struct *task, int clear, int set) {
    int cpu = task_cpu(task);
    struct psi_group *group;
    bool wake_clock = true;
    void *iter = NULL;

    if (!task->pid)
        return;

    psi_flags_change(task, clear, set);
    if (unlikely((clear & TSK_RUNNING) &&
                 (task->flags & PF_WQ_WORKER) &&
                 wq_worker_last_func(task) == psi_avgs_work))
        wake_clock = false;

    while ((group = iterate_groups(task, &iter)))
        psi_group_change(group, cpu, clear, set, wake_clock);
}

Optimization Strategies

Leverage common cgroup: when switching from task A to task B that share a cgroup, avoid redundant state updates and reduce psi_task_change() calls.

Reduce sleep‑induced state switches: collapse the psi_dequeue() ‑triggered psi_group_change() into psi_task_switch(), eliminating unnecessary cgroup updates.

Conclusion

PSI support has been merged into the mainline kernel and is enabled by default in the 5.4 veLinux kernel. Combined with cgroup‑v2 it helps discover resource bottlenecks, enables dynamic resource management, and powers projects such as the OOMD daemon for out‑of‑memory handling.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

kernel performance monitoring cgroup PSI

Written by

ByteDance SYS Tech

Focused on system technology, sharing cutting‑edge developments, innovation and practice, and analysis of industry tech hotspots.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.