Operations 14 min read

Understanding Linux PSI: Pressure Stall Information for System Resource Monitoring

Pressure Stall Information (PSI) is a Linux kernel feature that measures real‑time CPU, memory, and I/O pressure by tracking task wait times, offering finer granularity than load average or vmpressure, and enabling more accurate scheduling, cgroup management, and out‑of‑memory handling.

OPPO Kernel Craftsman
OPPO Kernel Craftsman
OPPO Kernel Craftsman
Understanding Linux PSI: Pressure Stall Information for System Resource Monitoring

Pressure Stall Information (PSI) provides a method for evaluating system resource pressure. A system has three fundamental resources: CPU, Memory, and IO. Regardless of how much these resources are increased, they never seem to fully satisfy software demands. Once resource competition occurs, it can lead to increased latency, causing users to experience system lag.

Without an accurate method to detect system resource pressure, there are two consequences: resource users may be overly restrained and not fully utilize system resources, or resource competition frequently occurs with excessive resource usage causing excessive wait latency. Accurate detection methods help resource users determine appropriate workloads while also helping systems develop efficient resource scheduling strategies to maximize resource utilization and improve user experience.

Facebook open-sourced a Linux kernel component and related tools in 2018 for solving critical compute cluster management problems. PSI is an important resource measurement tool that provides real-time detection of system resource competition degree, presented as competition wait time, enabling simple and accurate decision-making for users and resource schedulers.

Why PSI Was Introduced

Before PSI, Linux had other resource pressure evaluation methods, most notably Load Average and Vmpressure.

Load Average

System load average refers to the average number of processes in the running queue (running on CPU or waiting to run) during a specific time interval. The sum of running and uninterruptible state processes in Linux represents the current system load. The algorithm is:

for_each_possible_cpu(cpu)

nr_active += cpu_of(cpu)->nr_running + cpu_of(cpu)->nr_uninterruptible;

avenrun[n] =avenrun[0] * exp_n + nr_active * (1 - exp_n)

Load Average has inherent limitations: uninterruptible processes cannot distinguish between waiting for CPU or IO; the finest time granularity is 1 minute with 5-second sampling intervals; and results are presented as process counts requiring additional CPU count calculations.

Vmpressure

Vmpressure calculation occurs during each system attempt to do_try_to_free_pages for memory reclamation. The calculation method is: (1 - reclaimed/scanned)*100, meaning more failed memory page reclamation indicates greater memory pressure. It provides notification mechanisms with three pressure levels: low, medium, and critical.

PSI Software Architecture

PSI provides two types of interfaces through the filesystem: system-level interface for overall resource pressure information, and more granular grouping combined with control groups. PSI tracks task wait information for memory, IO, and CPU resources by instrumenting the memory management and scheduler modules.

Based on task-level information, PSI aggregates into per-CPU time information on PSI groups. PSI groups calculate current PSI values at fixed periods using sliding averages to avoid value fluctuation.

PSI User Interface Definition

Each resource type's pressure information is provided through separate files in /proc/pressure/: cpu, memory, and io. CPU pressure format: some avg10=2.98 avg60=2.81 avg300=1.41 total=26810 . Memory and IO have both "some" and "full" metrics: "some" represents time with at least one task blocked, "full" represents time when all non-idle tasks are blocked simultaneously.

Source Code Analysis

PSI source code is relatively simple, with core functionality implemented in kernel/sched/psi.c. Initialization involves creating proc filesystem nodes in psi_proc_init and initializing statistical management structures in psi_init. The system has six PSI states: PSI_IO_SOME, PSI_IO_FULL, PSI_MEM_SOME, PSI_MEM_FULL, PSI_CPU_SOME, and PSI_NONIDLE.

The core challenge of PSI is accurately capturing task state changes and calculating state duration. PSI adds a PSI_flags member to task_struct to mark task states using TSK_IOWAIT, TSK_MEMSTALL, and TSK_RUNNING. State marking is primarily done through the psi_task_change function, called whenever tasks enter/exit the scheduling queue.

Periodic updates in psi_update_work update statistics and set the next wakeup time with a period of PSI_FREQ (2 seconds). The update process involves get_recent_times to update each CPU's state time and calc_avgs to update time proportions for 10s, 60s, and 300s intervals.

PSI Applications

With accurate PSI evaluation of system resource pressure, many meaningful features can be implemented to maximize resource utilization. Facebook developed cgroup2 and oomd (a user-space out-of-memory monitoring service). Android replaced the default LMK (low memory killer) with user-space LMKD, which sets thresholds for /proc/pressure/memory's SOME and FULL values to trigger daemon processes for process termination when latency exceeds thresholds.

Memory ManagementLinux kernelsystem performanceCPU schedulingPressure Stall InformationPSIResource Monitoring
OPPO Kernel Craftsman
Written by

OPPO Kernel Craftsman

Sharing Linux kernel-related cutting-edge technology, technical articles, technical news, and curated tutorials

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.