Cloud Native 8 min read

How to Fix nvidia-smi Missing GPU Process Info Inside Containers

The article explains why nvidia-smi cannot display GPU processes when run inside a container, analyzes the underlying pid‑namespace isolation and kernel‑level restrictions, and provides three practical solutions—including using hostPid, custom kernel interception modules, and the nvitop tool—plus a workaround for gpu‑operator deployments.

Infra Learning Club

Mar 9, 2025

How to Fix nvidia-smi Missing GPU Process Info Inside Containers

Problem

Running nvidia-smi inside a container often shows no GPU processes. Two root causes are identified:

The container uses a PID namespace, preventing the Nvidia kernel module from seeing the container's PIDs and mapping them to host PIDs. nvidia-smi queries the kernel directly via the NVML API, which cannot translate PIDs across namespaces.

Solution 1 – Share host PID namespace

Configure the pod to use the host PID namespace, eliminating isolation:

apiVersion: v1
kind: Pod
metadata:
  name: view-pid
spec:
  hostPID: true
  containers:
    - name: view-pid
      image: ubuntu:22.04
      command: ["sh"]
      args: ["while true; do echo 'foo'; done;"]

Or start Docker with --pid=host:

docker run -d --pid=host --gpus all docker.io/pytorch/pytorch:2.5.1-cuda12.4-cudnn9-runtime

Advantage: simple, no extra components. Drawback: the container can see all host GPU processes, which may be unacceptable in multi‑tenant environments.

Solution 2 – Kernel interception module

Write a kernel module that intercepts

nvidiactl

ioctl

calls and translates PIDs between namespaces. Existing projects ( gh2o/nvidia-pidns, matpool/mpu, gpues/nvidia-pidns) are outdated for newer kernels. A fork adds support for kernel 5.7+ and can be installed via Helm:

helm install mpu oci://ghcr.io/lengrongfu/mpu --version 0.0.1

Key module functions:

Replace nvidiactl ’s unlocked_ioctl and compat_ioctl pointers with custom handlers ( nvidia_pidns_unlocked_ioctl, nvidia_pidns_compat_ioctl).

PID conversion logic uses helper functions ( fixer_0x0ee4, fixer_0x1f48, fixer_0x2588, fixer_0x3848) selected by driver version tag ( arg.tag).

Map PIDs across namespaces with kernel helpers find_vpid and find_pid_ns.

Lifecycle hooks nvidia_pidns_init (store original ioctl pointers) and nvidia_pidns_exit (restore them).

Solution 3 – nvitop monitor

nvitop

is an interactive NVIDIA device and process monitor that can display container GPU processes, but it also shows all host processes.

Install: pip3 install --upgrade nvitop Run:

python3 -m nvitop

gpu‑operator driver installation issue

When the gpu-operator installs drivers, the host may lose the nvidia-smi command because the driver pod runs in a different namespace from the host root. Execute the command inside the driver pod’s filesystem to restore access:

chroot /run/nvidia/driver nvidia-smi

Verification script

A minimal PyTorch script continuously consumes the GPU, confirming that the chosen solution exposes the GPU correctly:

import torch, time
if not torch.cuda.is_available():
    print("GPU is not available. Please run this script on a machine with a GPU.")
    exit()
device = torch.device("cuda")
matrix_size = 10240
matrix_a = torch.randn(matrix_size, matrix_size, device=device)
matrix_b = torch.randn(matrix_size, matrix_size, device=device)
while True:
    result = torch.matmul(matrix_a, matrix_b)
    time.sleep(0.1)

Reference repository

https://github.com/lengrongfu/mpu

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Kubernetes GPU Kernel Module helm pid namespace nvidia-smi nvitop

Written by

Infra Learning Club

Infra Learning Club shares study notes, cutting-edge technology, and career discussions.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.

Problem

Solution 1 – Share host PID namespace

Solution 2 – Kernel interception module

Solution 3 – nvitop monitor

gpu‑operator driver installation issue

Verification script

Reference repository

Infra Learning Club

How this landed with the community

Was this worth your time?

0 Comments

Solution 1 – Share host PID namespace

Solution 2 – Kernel interception module

Solution 3 – nvitop monitor