Understanding GPU Monitoring: Utilization Metrics and Failure Scenarios

This article systematically reviews GPU monitoring for large‑scale AI training, covering MFU/HFU definitions, key DCGM metrics, NVLink bandwidth, common failure codes such as Xid and SXid, experimental insights on T4 and H100 GPUs, and practical case studies for diagnosing and mitigating performance drops.

Linux Kernel Journey
Linux Kernel Journey
Linux Kernel Journey
Understanding GPU Monitoring: Utilization Metrics and Failure Scenarios

Background

Previous articles noted low GPU utilization and large‑scale task failures caused by GPU anomalies. This summary consolidates those observations, introduces common GPU monitoring metrics, and examines GPU failure modes.

MFU & HFU

Model FLOPS Utilization (MFU) = estimated FLOPS / theoretical hardware FLOPS. Hardware FLOPS Utilization (HFU) = actual FLOPS / theoretical hardware FLOPS, where actual FLOPS includes extra work from techniques such as gradient checkpointing; therefore HFU ≥ MFU. Meta’s “Maximizing training throughput using PyTorch FSDP” reports MFU of 50‑60 % on A100/A800 clusters and ≤50 % on H100 clusters.

GPU Monitoring Integration

NVIDIA DCGM (Data Center GPU Manager) provides health checks, diagnostics, alerts, and governance. DCGM‑Exporter exposes metrics to Prometheus; Grafana dashboards ( github.com/NVIDIA/dcgm-exporter/tree/main/grafana) visualize the data. A typical deployment runs one dcgm-exporter per node, scraped periodically by Prometheus.

Key DCGM Metrics

GPU Utilization – DCGM_FI_PROF_GR_ENGINE_ACTIVE. Low values indicate under‑utilization; high values do not guarantee full utilization.

SM Active – DCGM_FI_PROF_SM_ACTIVE. Insensitive to per‑block thread count.

SM Occupancy – DCGM_FI_PROF_SM_OCCUPANCY. Higher values do not always mean higher overall utilization.

Tensor Core Active – DCGM_FI_PROF_PIPE_TENSOR_ACTIVE. Useful for BF16 matrix‑multiply heavy workloads.

NVLink Bandwidth – DCGM_FI_DEV_NVLINK_BANDWIDTH_TOTAL and DCGM_FI_DEV_NVLINK_BANDWIDTH_L0 (per‑lane bandwidth).

Experimental Validation

GPU Util & SM Active on T4

1 Block × 1 Thread → GPU Util 100 % but SM Active ≈ 2.5 %.

40 Blocks × 1 Thread each → GPU Util 100 % and SM Active ≈ 100 %.

40 Blocks × 128 Threads each → GPU Util 100 % and SM Active ≈ 100 %, SM Occupancy ≈ 12.5 %.

Tensor Active on H100

Matrix‑multiply kernels (C = A·B) on an H100 (132 SMs, 4 Tensor Cores per SM) show Tensor Active scaling proportionally with the number of active blocks (e.g., 1 Block → ~0.7 % Tensor Active, 128 Blocks → ~96 % Tensor Active).

Tensor Active as HFU Proxy

When MFU is unavailable, Tensor Active approximates the upper bound of HFU because most LLM operations are BF16 matrix multiplications on Tensor Cores. In a 2 × 8 H100 node training a 3B LLM (16‑way data parallel), MFU ≈ 45.5 %, SM Active ≈ 80 %, Tensor Active ≈ 48 %.

NVLink Bandwidth Measurements

alltoall_perf -b 4G -e 4G -N 10000 -g 8

All‑to‑all on an 8‑GPU H100 node yields ~350 GB/s total bus bandwidth, ~290 GiB/s per GPU (≈ 16 GiB/s per lane). Enabling NVLink SHARP (NVSL) with NCCL_NVLS_ENABLE=1 raises total bus bandwidth to ~480 GiB/s while per‑GPU NVLink bandwidth drops to 100‑130 GiB/s, indicating reduction work off‑loaded to the NVSwitch.

GPU Failure and Error Handling

Xid Errors

DCGM field DCGM_FI_DEV_XID_ERRORS reports the most recent Xid code. Errors are categorized as user‑application‑induced (e.g., illegal memory access) or hardware‑induced (requiring GPU reset or repair).

SXid Errors

Specific to NVSwitch, SXid errors appear in kernel messages. Non‑fatal SXid errors do not terminate CUDA applications; fatal SXid errors propagate as Xid codes and terminate the process. Recovery steps include nvidia-smi -r or restarting the nvidia-fabricmanager service.

Memory Row Remap

Field DCGM_FI_DEV_ROW_REMAP_PENDING indicates pending memory‑row remapping, often a sign of correctable memory errors. Prompt GPU reset is recommended to avoid hidden training‑time loss spikes.

GSP and Other Errors

GSP (GPU System Processor) errors and other hardware faults (e.g., PCIe, NVSwitch timeouts) are documented in NVIDIA’s Xid error guide and observed in large‑scale clusters (Meta, IBM, Alibaba). Automated level‑3 diagnostics may be required for detection.

Case Studies

Task Slowdown (Straggler)

In a production job, one GPU showed significantly higher SM Active, indicating resource contention and causing a straggler. Evicting the offending GPU restored throughput. Megatron‑LM now includes straggler detection ( Megatron-LM/megatron/training/training.py).

Periodic Slowdown (GC)

Python garbage collection held the GIL on a rank, creating periodic stragglers. Enabling Megatron‑LM’s proactive GC synchronized across ranks mitigated the issue.

Fabric Manager Hang

On an H100 system, NCCL AllReduce hung when NVLink SHARP was active because an OOM process killed nvidia-fabricmanager. Restarting the service (or the node) resolved the hang.

Lemon Nodes

Meta’s “lemon node” detection identified a small fraction of nodes with disproportionately high failure rates, reducing overall job failure from 14 % to 4 % after eviction.

Other Real‑World Errors

Various Xid, SXid, and memory‑row‑remap incidents were observed across clusters (Meta, IBM, Alibaba), often correlating with high temperature or network issues.

References

https://pytorch.org/blog/maximizing-training/

https://github.com/NVIDIA/DCGM

https://github.com/NVIDIA/dcgm-exporter

https://github.com/NVIDIA/dcgm-exporter/tree/main/grafana

https://github.com/NVIDIA/DCGM/blob/b0ec3c624ea21e688b0d93cf9b214ae0eeb6fe52/dcgmlib/src/dcgm_fields.cpp

https://www.alibabacloud.com/help/zh/ack/ack-managed-and-ack-dedicated/user-guide/introduction-to-metrics

https://docs.nvidia.com/datacenter/tesla/fabric-manager-user-guide/index.html

https://docs.nvidia.com/deploy/xid-errors/index.html

https://www.volcengine.com/docs/6459/974350

https://github.com/NVIDIA/Megatron-LM/blob/main/megatron/training/training.py

https://github.com/NVIDIA/nccl/issues/976#issuecomment-1697103183

https://arxiv.org/abs/2410.21680

https://arxiv.org/abs/2407.05467

https://www.arxiv.org/abs/2408.14158

https://arxiv.org/abs/2407.21783

https://arxiv.org/abs/2403.07648

https://github.com/NVIDIA/DCGM/issues/64

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

LLM trainingMFUGPU monitoringNVLinkDCGMGPU failuresHFU
Linux Kernel Journey
Written by

Linux Kernel Journey

Linux Kernel Journey

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.