How Linux PSI Quantifies Resource Bottlenecks and Boosts Performance
This article explains Linux's Pressure Stall Information (PSI) mechanism, its /proc and cgroup interfaces, how to monitor CPU, memory, and I/O pressure, and presents code‑level optimizations to reduce PSI overhead and improve system performance.
Background
Understanding operating system principles, the performance of a business process depends on the allocation of various system resources such as CPU, memory, and I/O. A process repeatedly waits for resources and executes, and excessive waiting harms throughput and latency. PSI quantifies these waiting times.
Purpose
PSI provides a real‑time metric that reflects business process throughput and latency, helping identify resource bottlenecks, guide deployment density, and dynamically adjust resource allocation to meet performance requirements.
Monitoring Interfaces
PSI exports three files under /proc/pressure that report system‑level CPU, memory, and I/O pressure.
<code># cat /proc/pressure/cpu
some avg10=0.00 avg60=0.00 avg300=0.00 total=0
full avg10=0.00 avg60=0.00 avg300=0.00 total=0
# cat /proc/pressure/memory
some avg10=0.00 avg60=0.00 avg300=0.00 total=0
full avg10=0.00 avg60=0.00 avg300=0.00 total=0
# cat /proc/pressure/io
some avg10=0.00 avg60=0.00 avg300=0.00 total=0
full avg10=0.00 avg60=0.00 avg300=0.00 total=0</code>some indicates the proportion of time at least one thread experiences a resource bottleneck; full indicates the proportion of time all threads are stalled. avg10 , avg60 , and avg300 are moving‑average percentages over 10 s, 60 s, and 300 s windows, while total is the accumulated stall time.
Trigger Interface
Besides reading these files, writing to them registers a trigger; programs can then wait for events using select() , poll() or epoll() .
<code>#include <errno.h>
#include <fcntl.h>
#include <stdio.h>
#include <poll.h>
#include <string.h>
#include <unistd.h>
/* Monitor memory partial stall with 1s tracking window size and 150ms threshold. */
int main() {
const char trig[] = "some 150000 1000000";
struct pollfd fds;
int n;
fds.fd = open("/proc/pressure/memory", O_RDWR | O_NONBLOCK);
if (fds.fd < 0) {
printf("/proc/pressure/memory open error: %s\n", strerror(errno));
return 1;
}
fds.events = POLLPRI;
if (write(fds.fd, trig, strlen(trig) + 1) < 0) {
printf("/proc/pressure/memory write error: %s\n", strerror(errno));
return 1;
}
printf("waiting for events...\n");
while (1) {
n = poll(&fds, 1, -1);
if (n < 0) {
printf("poll error: %s\n", strerror(errno));
return 1;
}
if (fds.revents & POLLERR) {
printf("got POLLERR, event source is gone\n");
return 0;
}
if (fds.revents & POLLPRI) {
printf("event triggered!\n");
} else {
printf("unknown event received: 0x%x\n", fds.revents);
return 1;
}
}
return 0;
}
</code>cgroup Interface
When using cgroup‑v2, PSI also provides per‑cgroup pressure files: cpu.pressure , memory.pressure , and io.pressure inside each cgroup directory. They are read and written in the same way as the /proc/pressure files.
Performance Optimization
While PSI offers valuable real‑time metrics, its hooks add overhead by instrumenting scheduler, I/O, and memory paths. To keep PSI enabled in production, the overhead must be mitigated.
Code Analysis
psi_task_change() is invoked at every PSI hook point (e.g., task start/stop waiting for CPU or memory) and also updates each cgroup the task belongs to, leading to high call frequency and cost.
<code>void psi_task_change(struct task_struct *task, int clear, int set) {
int cpu = task_cpu(task);
struct psi_group *group;
bool wake_clock = true;
void *iter = NULL;
if (!task->pid)
return;
psi_flags_change(task, clear, set);
if (unlikely((clear & TSK_RUNNING) &&
(task->flags & PF_WQ_WORKER) &&
wq_worker_last_func(task) == psi_avgs_work))
wake_clock = false;
while ((group = iterate_groups(task, &iter)))
psi_group_change(group, cpu, clear, set, wake_clock);
}
</code>Optimization Strategies
Leverage common cgroup: when switching from task A to task B that share a cgroup, avoid redundant state updates and reduce psi_task_change() calls.
Reduce sleep‑induced state switches: collapse the psi_dequeue() ‑triggered psi_group_change() into psi_task_switch() , eliminating unnecessary cgroup updates.
Conclusion
PSI support has been merged into the mainline kernel and is enabled by default in the 5.4 veLinux kernel. Combined with cgroup‑v2 it helps discover resource bottlenecks, enables dynamic resource management, and powers projects such as the OOMD daemon for out‑of‑memory handling.
ByteDance SYS Tech
Focused on system technology, sharing cutting‑edge developments, innovation and practice, and analysis of industry tech hotspots.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.