How to Fix nvidia-smi Missing GPU Process Info Inside Containers
The article explains why nvidia-smi cannot display GPU processes when run inside a container, analyzes the underlying pid‑namespace isolation and kernel‑level restrictions, and provides three practical solutions—including using hostPid, custom kernel interception modules, and the nvitop tool—plus a workaround for gpu‑operator deployments.
Problem
Running nvidia-smi inside a container often shows no GPU processes. Two root causes are identified:
The container uses a PID namespace, preventing the Nvidia kernel module from seeing the container's PIDs and mapping them to host PIDs. nvidia-smi queries the kernel directly via the NVML API, which cannot translate PIDs across namespaces.
Solution 1 – Share host PID namespace
Configure the pod to use the host PID namespace, eliminating isolation:
apiVersion: v1
kind: Pod
metadata:
name: view-pid
spec:
hostPID: true
containers:
- name: view-pid
image: ubuntu:22.04
command: ["sh"]
args: ["while true; do echo 'foo'; done;"]Or start Docker with --pid=host:
docker run -d --pid=host --gpus all docker.io/pytorch/pytorch:2.5.1-cuda12.4-cudnn9-runtimeAdvantage: simple, no extra components. Drawback: the container can see all host GPU processes, which may be unacceptable in multi‑tenant environments.
Solution 2 – Kernel interception module
Write a kernel module that intercepts
nvidiactl ioctlcalls and translates PIDs between namespaces. Existing projects ( gh2o/nvidia-pidns, matpool/mpu, gpues/nvidia-pidns) are outdated for newer kernels. A fork adds support for kernel 5.7+ and can be installed via Helm:
helm install mpu oci://ghcr.io/lengrongfu/mpu --version 0.0.1Key module functions:
Replace nvidiactl ’s unlocked_ioctl and compat_ioctl pointers with custom handlers ( nvidia_pidns_unlocked_ioctl, nvidia_pidns_compat_ioctl).
PID conversion logic uses helper functions ( fixer_0x0ee4, fixer_0x1f48, fixer_0x2588, fixer_0x3848) selected by driver version tag ( arg.tag).
Map PIDs across namespaces with kernel helpers find_vpid and find_pid_ns.
Lifecycle hooks nvidia_pidns_init (store original ioctl pointers) and nvidia_pidns_exit (restore them).
Solution 3 – nvitop monitor
nvitopis an interactive NVIDIA device and process monitor that can display container GPU processes, but it also shows all host processes.
Install: pip3 install --upgrade nvitop Run:
python3 -m nvitopgpu‑operator driver installation issue
When the gpu-operator installs drivers, the host may lose the nvidia-smi command because the driver pod runs in a different namespace from the host root. Execute the command inside the driver pod’s filesystem to restore access:
chroot /run/nvidia/driver nvidia-smiVerification script
A minimal PyTorch script continuously consumes the GPU, confirming that the chosen solution exposes the GPU correctly:
import torch, time
if not torch.cuda.is_available():
print("GPU is not available. Please run this script on a machine with a GPU.")
exit()
device = torch.device("cuda")
matrix_size = 10240
matrix_a = torch.randn(matrix_size, matrix_size, device=device)
matrix_b = torch.randn(matrix_size, matrix_size, device=device)
while True:
result = torch.matmul(matrix_a, matrix_b)
time.sleep(0.1)Reference repository
https://github.com/lengrongfu/mpu
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Infra Learning Club
Infra Learning Club shares study notes, cutting-edge technology, and career discussions.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
