Linux Kernel Journey
Dec 22, 2024 · Artificial Intelligence
Understanding GPU Monitoring: Utilization Metrics and Failure Scenarios
This article systematically reviews GPU monitoring for large‑scale AI training, covering MFU/HFU definitions, key DCGM metrics, NVLink bandwidth, common failure codes such as Xid and SXid, experimental insights on T4 and H100 GPUs, and practical case studies for diagnosing and mitigating performance drops.
DCGMGPU failuresGPU monitoring
0 likes · 26 min read
