Operations 15 min read

How Alibaba Cloud Prometheus Enables Ultra‑Fast Host Monitoring for Supercomputing

This article explains the unique business characteristics of large‑scale supercomputing workloads, outlines the observability challenges they pose, and details how Alibaba Cloud Prometheus host monitoring provides automated service discovery, rapid probe deployment, fine‑grained metrics, and ready‑to‑use Grafana dashboards to achieve second‑level monitoring at massive scale.

Alibaba Cloud Observability

May 29, 2024

How Alibaba Cloud Prometheus Enables Ultra‑Fast Host Monitoring for Supercomputing

Supercomputing Scenario Business Characteristics

Large‑scale computing: thousands of processor cores cooperate to split and accelerate tasks, with elastic task scheduling on cloud ECS instances.

High performance and throughput: sustained high‑throughput processing for big‑data analytics, climate simulation, bio‑informatics, etc.

Elastic computing: workloads vary from hours to days; resources are provisioned on demand and released after completion.

Business peaks and troughs: demand spikes during specific periods and drops during others, causing obvious workload fluctuations.

Mixed compute tasks: simultaneous use of CPU, GPU, RDMA and other heterogeneous resources.

Observability Challenges in Supercomputing

Fine‑grained monitoring: second‑level tracking of node status, load, network latency, etc.

Process‑level monitoring: need to observe resource consumption of individual compute processes and their threads.

Automated service discovery: instant identification of newly added or removed nodes during elastic scaling.

Automatic probe deployment: rapid installation of appropriate exporters (Node, Process, GPU, middleware) on new hosts.

Data tag classification: attaching metadata such as organization, environment, and business tags to metrics for better filtering and aggregation.

Alibaba Cloud Prometheus Host Monitoring Solution

Alibaba Cloud Prometheus provides a comprehensive host‑monitoring solution for ECS, on‑premise, and other cloud servers. It automatically installs open‑source exporters based on host type, manages a hosted Prometheus Agent for unified data collection, and supports unified storage, visualization, and alerting.

Key Features

Host second‑level discovery : automatic service‑discovery adapts to dynamic cloud resources, ensuring all running instances are monitored instantly.

Probe second‑level installation : exporters are auto‑installed, enabling immediate metric collection without manual intervention.

Metric second‑level collection : automated configuration generation simplifies setup and allows flexible adjustment of collection intervals (1‑60 s).

Serverless probe management : hosted Prometheus Agent centralizes data collection, reducing operational overhead.

Intelligent metric tags : host tags (region, resource group, etc.) are auto‑injected; users can add custom business or environment tags.

Massive‑scale data ingestion and storage : supports ultra‑large host fleets with dynamic resource allocation and high‑performance query capabilities.

End‑to‑end monitoring data : integrates hardware, OS, application, and external service metrics for full‑stack observability.

Process‑level monitoring : tracks CPU, memory, disk I/O, start time, file handles, thread counts, and more for each process.

Built‑in Grafana dashboards : ready‑to‑use panels for ECS overview, detail, GPU overview, GPU detail, and node‑process views.

Typical deployment time for a host to join monitoring is 30‑60 seconds, with metric collection intervals configurable between 1‑60 seconds.

Integration and Access

Users select the ECS environment in the real‑time monitoring console, add hosts, and the system automatically installs the appropriate exporters (Node‑exporter, Process‑exporter for CPU hosts; GPU‑exporter for GPU hosts). Multiple service‑discovery methods are supported, and the status of exporter installation and running is displayed.

Performance Highlights

Rapid service discovery: new nodes in a 500‑host scale elastic event are detected within one minute.

Exporter deployment in under a minute, enabling near‑real‑time metric generation.

Low observability latency: from node creation to visible metrics within two minutes.

Fast data collection stop for decommissioned nodes, also within two minutes.

High concurrent processing capacity to handle large‑scale host expansions.

Summary

By providing adaptive service discovery, automated exporter deployment, auto‑generated Prometheus configurations, and a managed Prometheus Agent, Alibaba Cloud Prometheus delivers a fast, reliable, and scalable host‑monitoring solution that meets the demanding observability needs of supercomputing workloads.

Alibaba Cloud Prometheus host monitoring architecture

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Observability supercomputing host monitoring

Written by

Alibaba Cloud Observability

Driving continuous progress in observability technology!

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.