Cloud Native 8 min read

How to Fix nvidia-smi Missing GPU Process Info Inside Containers

The article explains why nvidia-smi cannot display GPU processes when run inside a container, analyzes the underlying pid‑namespace isolation and kernel‑level restrictions, and provides three practical solutions—including using hostPid, custom kernel interception modules, and the nvitop tool—plus a workaround for gpu‑operator deployments.

Infra Learning Club
Infra Learning Club
Infra Learning Club
How to Fix nvidia-smi Missing GPU Process Info Inside Containers

Problem

Running nvidia-smi inside a container often shows no GPU processes. Two root causes are identified:

The container uses a PID namespace, preventing the Nvidia kernel module from seeing the container's PIDs and mapping them to host PIDs. nvidia-smi queries the kernel directly via the NVML API, which cannot translate PIDs across namespaces.

Solution 1 – Share host PID namespace

Configure the pod to use the host PID namespace, eliminating isolation:

apiVersion: v1
kind: Pod
metadata:
  name: view-pid
spec:
  hostPID: true
  containers:
    - name: view-pid
      image: ubuntu:22.04
      command: ["sh"]
      args: ["while true; do echo 'foo'; done;"]

Or start Docker with --pid=host:

docker run -d --pid=host --gpus all docker.io/pytorch/pytorch:2.5.1-cuda12.4-cudnn9-runtime

Advantage: simple, no extra components. Drawback: the container can see all host GPU processes, which may be unacceptable in multi‑tenant environments.

Solution 2 – Kernel interception module

Write a kernel module that intercepts

nvidiactl
ioctl

calls and translates PIDs between namespaces. Existing projects ( gh2o/nvidia-pidns, matpool/mpu, gpues/nvidia-pidns) are outdated for newer kernels. A fork adds support for kernel 5.7+ and can be installed via Helm:

helm install mpu oci://ghcr.io/lengrongfu/mpu --version 0.0.1

Key module functions:

Replace nvidiactl ’s unlocked_ioctl and compat_ioctl pointers with custom handlers ( nvidia_pidns_unlocked_ioctl, nvidia_pidns_compat_ioctl).

PID conversion logic uses helper functions ( fixer_0x0ee4, fixer_0x1f48, fixer_0x2588, fixer_0x3848) selected by driver version tag ( arg.tag).

Map PIDs across namespaces with kernel helpers find_vpid and find_pid_ns.

Lifecycle hooks nvidia_pidns_init (store original ioctl pointers) and nvidia_pidns_exit (restore them).

Solution 3 – nvitop monitor

nvitop

is an interactive NVIDIA device and process monitor that can display container GPU processes, but it also shows all host processes.

Install: pip3 install --upgrade nvitop Run:

python3 -m nvitop

gpu‑operator driver installation issue

When the gpu-operator installs drivers, the host may lose the nvidia-smi command because the driver pod runs in a different namespace from the host root. Execute the command inside the driver pod’s filesystem to restore access:

chroot /run/nvidia/driver nvidia-smi

Verification script

A minimal PyTorch script continuously consumes the GPU, confirming that the chosen solution exposes the GPU correctly:

import torch, time
if not torch.cuda.is_available():
    print("GPU is not available. Please run this script on a machine with a GPU.")
    exit()
device = torch.device("cuda")
matrix_size = 10240
matrix_a = torch.randn(matrix_size, matrix_size, device=device)
matrix_b = torch.randn(matrix_size, matrix_size, device=device)
while True:
    result = torch.matmul(matrix_a, matrix_b)
    time.sleep(0.1)

Reference repository

https://github.com/lengrongfu/mpu

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

KubernetesGPUKernel Modulehelmpid namespacenvidia-sminvitop
Infra Learning Club
Written by

Infra Learning Club

Infra Learning Club shares study notes, cutting-edge technology, and career discussions.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.