Tagged articles
6 articles
Page 1 of 1
Didi Tech
Didi Tech
Mar 11, 2026 · Cloud Native

How Huatuo Now Monitors MetaX GPUs for Cloud‑Native AI Workloads

Huatuo, the open‑source deep‑observability platform backed by Didi, now supports real‑time monitoring of MetaX GPUs, offering detailed hardware metrics via Docker or Kubernetes deployments and exposing them through a /metrics endpoint for cloud‑native AI and operations use cases.

AI InfrastructureCloud NativeGPU monitoring
0 likes · 4 min read
How Huatuo Now Monitors MetaX GPUs for Cloud‑Native AI Workloads
Efficient Ops
Efficient Ops
Dec 15, 2025 · Operations

Mastering nvitop: Interactive NVIDIA GPU Monitoring and Management

This guide introduces nvitop, an interactive NVIDIA‑GPU process viewer and resource manager, explains its key features, shows how to install it via uvx/pipx, demonstrates basic device and process commands as well as the real‑time monitoring mode, and provides troubleshooting tips for common issues.

CLIGPU monitoringLinux
0 likes · 5 min read
Mastering nvitop: Interactive NVIDIA GPU Monitoring and Management
Infra Learning Club
Infra Learning Club
Feb 16, 2025 · Operations

GPUprobe: Using eBPF to Monitor CUDA Memory Leaks

The article introduces GPUprobe, an eBPF‑based tool that provides lightweight, continuous, application‑level monitoring of CUDA memory allocation, leaks, and kernel launches, compares it with NSight Systems and DCGM, and demonstrates near‑zero overhead integration with Prometheus and Grafana through detailed code examples and real‑world output analysis.

GPU monitoringGrafanaPrometheus
0 likes · 13 min read
GPUprobe: Using eBPF to Monitor CUDA Memory Leaks
Linux Kernel Journey
Linux Kernel Journey
Dec 22, 2024 · Artificial Intelligence

Understanding GPU Monitoring: Utilization Metrics and Failure Scenarios

This article systematically reviews GPU monitoring for large‑scale AI training, covering MFU/HFU definitions, key DCGM metrics, NVLink bandwidth, common failure codes such as Xid and SXid, experimental insights on T4 and H100 GPUs, and practical case studies for diagnosing and mitigating performance drops.

DCGMGPU failuresGPU monitoring
0 likes · 26 min read
Understanding GPU Monitoring: Utilization Metrics and Failure Scenarios
Liangxu Linux
Liangxu Linux
Mar 16, 2020 · Operations

How to Monitor CPU and GPU Temperatures on Ubuntu with Sensors, Glances, and i7z

This guide explains how to install and use three command‑line tools—lm‑sensors, Glances, and i7z—on Ubuntu to monitor CPU and GPU temperatures, fan speeds, and other hardware metrics, providing step‑by‑step commands and example outputs for effective laptop cooling diagnostics.

CPU temperatureGPU monitoringGlances
0 likes · 6 min read
How to Monitor CPU and GPU Temperatures on Ubuntu with Sensors, Glances, and i7z
Qunar Tech Salon
Qunar Tech Salon
Mar 27, 2019 · Artificial Intelligence

Profiling TensorFlow Performance with TensorBoard and Timeline

This article explains how to use TensorBoard and the Timeline tool to monitor TensorFlow GPU utilization, identify operation bottlenecks, and visualize execution times, including code examples and steps for exporting and merging profiling data for repeated runs.

GPU monitoringTensorBoardTensorFlow
0 likes · 7 min read
Profiling TensorFlow Performance with TensorBoard and Timeline