Cloud Native 8 min read

Why Containerd 2.x Fails to Find nvidia‑smi with GPU‑Operator and How to Fix It

When deploying a Kubernetes cluster with kubespray and the NVIDIA runtime, Containerd 2.x reports "nvidia‑smi not found" because the go‑toml v2 parser treats the "binaryName" key differently, causing the wrong runtime wrapper to be used; the article details the configuration inspection, version comparison, code demonstrations, and practical work‑arounds.

Infra Learning Club
Infra Learning Club
Infra Learning Club
Why Containerd 2.x Fails to Find nvidia‑smi with GPU‑Operator and How to Fix It

Issue Overview

Installing a Kubernetes cluster with kubespray and configuring the NVIDIA runtime fails with nvidia-smi not found in $PATH when using Containerd 2.x. The failure is caused by a change in how Containerd 2.x parses its TOML configuration files.

Containerd Configuration Generated by Kubespray

Kubespray writes a runc.options section that includes a lower‑case binaryName key:

containerd_runc_runtime:
  name: runc
  type: "io.containerd.runc.v2"
  options:
    systemdCgroup: "{{ containerd_use_systemd_cgroup | ternary('true', 'false') }}"
    binaryName: "{{ bin_dir }}/runc"

The official Containerd configuration (see https://github.com/containerd/containerd/blob/main/docs/man/containerd-config.toml.5.md) defines the key as BinaryName:

[plugins."io.containerd.grpc.v1.cri".containerd.runtimes.runc.options]
  BinaryName = "/usr/bin/runc"

Interaction with nvidia-container-toolkit

During installation, nvidia-container-toolkit copies all entries from runc.options and adds its own BinaryName under a new nvidia.options block, resulting in a configuration that contains both BinaryName and binaryName fields:

[plugins."io.containerd.grpc.v1.cri".containerd.runtimes.nvidia.options]
  BinaryName = "/usr/local/nvidia/toolkit/nvidia-container-runtime"
  binaryName = "/usr/local/bin/runc"
  systemdCgroup = true

Version Comparison: Containerd 1.x vs 2.x

Using a RuntimeClass that points to runc:

Containerd 1.7.23 creates the nginx pod successfully, indicating that the lower‑case binaryName field is ignored.

Containerd 2.0.3 hangs while creating the same pod, showing that it selects the lower‑case binaryName value (e.g., /usr/local/bin/test) as the executable, which does not contain the NVIDIA wrapper and therefore cannot find nvidia-smi.

Root Cause: go‑toml Library Upgrade

Containerd 2.x switched from go-toml v1.9.5 to v2.2.4. The v2 parser is case‑sensitive and prefers the lower‑case key, whereas the v1 parser treated keys case‑insensitively and favored BinaryName. This regression causes the NVIDIA wrapper binary to be bypassed.

containerd 1.x uses go-toml v1.9.5
containerd 2.x uses go-toml v2.2.4

Demonstration with Go Code

Two small programs parse a TOML fragment containing both keys. The program using github.com/pelletier/go-toml/v2 prints the value of binaryName, while the program using the legacy github.com/pelletier/go-toml prints BinaryName, confirming the case‑sensitivity change.

package main
import (
    "fmt"
    toml "github.com/pelletier/go-toml/v2"
)

type Config struct { BinaryName string `protobuf:"bytes,6,opt,name=binary_name,json=binaryName,proto3" json:"binary_name,omitempty"` }

func main() {
    config := `BinaryName = "BinaryName"
binaryName = "binaryName"`
    var c Config
    if err := toml.Unmarshal([]byte(config), &c); err != nil { fmt.Println(err); return }
    fmt.Println(c.BinaryName)
}

Running the same code with the older library prints BinaryName, illustrating the different parsing behaviour.

Work‑arounds and Fixes

Temporarily delete the binaryName entry from the Kubespray‑generated config.toml.

Align the configuration keys with the official definition by using only BinaryName.

Upgrade to a Containerd version that back‑ports the correct parsing logic or downgrade to a version that still uses go‑toml v1.

These steps prevent Containerd from invoking the wrong binary and restore visibility of nvidia-smi inside containers.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

runtimecontainerdkubespraynvidia-smigo-tomlnvidia-gpu-operator
Infra Learning Club
Written by

Infra Learning Club

Infra Learning Club shares study notes, cutting-edge technology, and career discussions.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.