Artificial Intelligence 19 min read

Achieving Full Observability for AI Inference Apps with Prometheus

This article explores the observability challenges of AI inference services, outlines a comprehensive Prometheus‑based metric collection strategy, and demonstrates practical monitoring implementations for Ray Serve, vLLM, GPU resources, and custom metrics to build stable, high‑performance inference pipelines.

Alibaba Cloud Observability

Mar 24, 2025

Achieving Full Observability for AI Inference Apps with Prometheus

AI Inference Observability Challenges

With the rapid rise of large language models (LLMs) such as DeepSeek, AI inference workloads have exploded, exposing performance bottlenecks in latency, throughput, resource usage, model loading, and distributed coordination.

Prometheus‑Based Monitoring Solution

Prometheus offers a multi‑dimensional data model, pull‑based collection, rich ecosystem, alerting, and Grafana visualisation, making it ideal for cloud‑native AI inference services.

Multi‑dimensional metrics via labels (e.g., GPU ID, model name, request type).

Efficient pull mechanism avoids data loss.

Extensive exporters and client libraries integrate with Ray Serve, vLLM, and other frameworks.

Alertmanager provides rule‑based alerts.

Grafana dashboards give real‑time insight.

Alibaba Cloud AI Stack observability solution

Ray Serve Full‑Stack Monitoring

Ray Serve exposes built‑in metrics (e.g., request counters, error counters, replica starts) in Prometheus format. Custom metrics can be added via ray.serve.metrics to track model‑specific tags, request types, or priority levels.

from ray import serve
from ray.serve import metrics
import time, requests

@serve.deployment
class MyDeployment:
    def __init__(self):
        self.num_requests = 0
        self.my_counter = metrics.Counter(
            "my_counter",
            description="Number of odd‑numbered requests",
            tag_keys=("model",),
        )
        self.my_counter.set_default_tags({"model": "123"})
    def __call__(self):
        self.num_requests += 1
        if self.num_requests % 2 == 1:
            self.my_counter.inc()

my_deployment = MyDeployment.bind()
serve.run(my_deployment)
while True:
    requests.get("http://localhost:8000/")
    time.sleep(1)

PodMonitor resources can automatically scrape Head and Worker pods in a Kubernetes cluster.

vLLM Built‑In Metrics

vLLM provides a comprehensive set of Prometheus metrics covering system state, iteration statistics, request latency, token processing, and speculative decoding. Examples include vllm:num_requests_running, vllm:gpu_cache_usage_perc, vllm:e2e_request_latency_seconds, and vllm:spec_decode_efficiency. These metrics enable fine‑grained performance analysis of large‑scale model serving.

Custom Metric Implementation

Two approaches are recommended:

Leverage Ray’s ray.util.metrics utilities when vLLM runs inside Ray.

Use the Prometheus Python client directly for standalone services.

Reference implementation can be found in the vLLM source ( vllm/engine/metrics.py).

Infrastructure‑Level Monitoring

Beyond application metrics, monitor GPU utilization, memory, temperature, node CPU/memory/I/O, and network latency using Prometheus node exporters and eBPF probes. Kubernetes HPA can be combined with Ray Serve’s autoscaling for elastic scaling.

Conclusion & Future Outlook

Full‑stack observability—from gateway to GPU—ensures stable, high‑throughput AI inference. Future work includes AI‑driven analysis of monitoring data to auto‑detect bottlenecks and trigger optimization actions, reducing operational complexity and supporting massive scale deployments.

observability vLLM Prometheus AI inference Ray Serve

Written by

Alibaba Cloud Observability

Driving continuous progress in observability technology!

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.