Cloud Native 4 min read

How Huatuo Now Monitors MetaX GPUs for Cloud‑Native AI Workloads

Huatuo, the open‑source deep‑observability platform backed by Didi, now supports real‑time monitoring of MetaX GPUs, offering detailed hardware metrics via Docker or Kubernetes deployments and exposing them through a /metrics endpoint for cloud‑native AI and operations use cases.

Didi Tech
Didi Tech
Didi Tech
How Huatuo Now Monitors MetaX GPUs for Cloud‑Native AI Workloads

Project Overview

Huatuo is an open‑source deep‑observability project initiated by Didi and incubated by the China Computer Federation (CCF). It provides kernel‑level monitoring for cloud‑native general computing, AI workloads, and core services, covering components such as GPU, CPU, caches, TLB, memory ECC, PCIe, NIC links, and ACPI.

MetaX GPU Support

Huatuo now integrates with MetaX GPUs via the libmxsml library. When enabled, it can collect real‑time GPU information including model, identifier, driver version, power consumption, temperature, utilization, clock frequencies, PCIe bandwidth, and MetaXLink communication metrics.

Exposed Metrics

GPU basic info: model, identifier, driver version
GPU status: power, temperature, utilization, clock frequencies
GPU communication: PCIe speed/bandwidth, MetaXLink speed/bandwidth

Container Deployment

To enable MetaX GPU monitoring in a container, mount the required system paths and run the Huatuo image. Example Docker command:

docker run --privileged --cgroupns=host --network=host \
    -v /sys:/sys \
    -v /proc:/proc \
    -v /run:/run \
    -v /opt/maca:/opt/maca \
    -v /opt/mxdriver:/opt/mxdriver \
    -v /dev/dri:/dev/dri \
    huatuo/huatuo-bamai:latest

In Kubernetes, create the appropriate PersistentVolume and PersistentVolumeClaim, then access the service’s /metrics endpoint. Presence of metrics prefixed with metax_ indicates successful GPU data collection.

Metric Index Definitions

GPU index: starts at 0 for Native and VF modes, at 100 for PF mode
CE: Correctable Errors
UE: Uncorrectable Errors
MetaXLink: proprietary GPU‑to‑GPU interconnect, indices start at 1

Repository

Huatuo project GitHub: https://github.com/ccfos/huatuo

cloud-nativeobservabilityHuatuoAI infrastructureGPU MonitoringMetaX
Didi Tech
Written by

Didi Tech

Official Didi technology account

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.