How to Build a Full‑Stack Observability Solution for AI Inference with Prometheus

This article explores the monitoring challenges of large‑scale AI inference services, outlines the key observability requirements, and provides a complete Prometheus‑based metric collection framework—including Ray Serve and vLLM integrations—to help developers build stable, high‑performance inference applications.

Alibaba Cloud Developer
Alibaba Cloud Developer
Alibaba Cloud Developer
How to Build a Full‑Stack Observability Solution for AI Inference with Prometheus

AI Inference Observability Requirements and Pain Points

With the rapid adoption of large language models (LLMs) such as DeepSeek, the demand for AI inference services has grown exponentially, exposing performance bottlenecks in both cloud‑based and self‑hosted deployments. Key observability needs include fine‑grained performance monitoring, resource usage tracking, model load/unload overhead, model behavior monitoring, and distributed architecture health.

Complete Prometheus‑Based Solution

Prometheus offers a multi‑dimensional data model, pull‑based collection, a rich ecosystem of exporters, built‑in alerting, and seamless integration with Grafana for visualization. By leveraging these features, developers can instrument AI inference services end‑to‑end.

Prometheus Advantages

Multi‑dimensional data model with label‑based filtering.

Efficient pull mechanism avoids data loss.

Extensive exporter ecosystem (e.g., Ray Serve, vLLM).

Powerful Alertmanager for rule‑based alerts.

Grafana integration for intuitive dashboards.

Full‑Link Observability Practice

Ray Serve Integration

Ray Serve addresses flexibility, performance, and scalability shortcomings of traditional inference frameworks. Its features include dynamic scaling, multi‑model support, batch processing, and seamless Kubernetes deployment. Ray Serve already exposes a set of built‑in metrics in Prometheus format, such as request counters, error counters, and replica restarts. ray start --head --metrics-export-port=8080 In a Kubernetes environment, a PodMonitor can be used to scrape metrics from both head and worker nodes.

apiVersion: monitoring.coreos.com/v1
kind: PodMonitor
metadata:
  name: ray-workers-monitor
  labels:
    release: prometheus
spec:
  jobLabel: ray-workers
  namespaceSelector:
    matchNames:
    - default
  selector:
    matchLabels:
      ray.io/node-type: worker
  podMetricsEndpoints:
  - port: metrics

Custom Metrics in Ray Serve

Beyond the built‑in metrics, developers can define custom counters, gauges, or histograms using ray.serve.metrics to track business‑specific dimensions such as model version, request priority, or token latency.

from ray import serve
from ray.serve import metrics

@serve.deployment
class MyDeployment:
    def __init__(self):
        self.my_counter = metrics.Counter(
            "my_counter",
            description="Number of odd‑numbered requests",
            tag_keys=("model",)
        )
        self.my_counter.set_default_tags({"model": "123"})

    def __call__(self):
        # custom logic
        self.my_counter.inc()

vLLM Integration

vLLM is a high‑performance LLM inference engine that provides its own set of Prometheus metrics covering system state, iteration statistics, request latency, token processing, and speculative decoding. Example metric groups include:

Running and waiting request counts.

GPU/CPU KV cache usage percentages.

Token processing histograms (prompt, generation, total).

End‑to‑end request latency and queue times.

Speculative decoding acceptance rates and efficiency.

These metrics are exposed via the /metrics endpoint of the vLLM OpenAI‑compatible server.

$ curl http://0.0.0.0:8000/metrics
# HELP vllm:iteration_tokens_total Histogram of number of tokens per engine step.
# TYPE vllm:iteration_tokens_total histogram
vllm:iteration_tokens_total_sum{model_name="unsloth/Llama-3.2-1B-Instruct"} 0.0
...

When deployed in Kubernetes, a PodMonitor with the label app: vllm-server can collect these metrics.

apiVersion: monitoring.coreos.com/v1
kind: PodMonitor
metadata:
  name: vllm-monitor
  labels:
    release: prometheus
spec:
  jobLabel: vllm-monitor
  namespaceSelector:
    matchNames:
    - default
  selector:
    matchLabels:
      app: vllm-server
  podMetricsEndpoints:
  - port: metrics

Infrastructure‑Level Monitoring

GPU Device Monitoring

Key GPU metrics include utilization, memory usage, and temperature. Consistent collection across Alibaba Cloud GPU services (e.g., Elastic GPU, ECS GPU, PAI) is essential.

Compute Node Monitoring

Monitor high‑performance network, CPU, memory, and disk I/O. Prometheus node discovery can automatically scrape new nodes and apply a full set of probes.

Kubernetes Orchestration Monitoring

Combining Ray Serve and vLLM on Kubernetes leverages Ray’s dynamic scaling and vLLM’s GPU efficiency. Kubernetes HPA can further auto‑scale pods based on load, while Prometheus monitors API server, scheduler, controller, node, and pod metrics, including eBPF‑based network analysis.

Conclusion and Outlook

Full‑link monitoring of AI inference services must cover the entire path from traffic entry to GPU compute. Prometheus provides a flexible, multi‑stack compatible foundation for collecting, alerting, and visualizing these metrics. Future work will integrate AI‑driven analysis of monitoring data to automatically detect bottlenecks and trigger optimization actions, reducing operational complexity and enabling self‑optimizing inference platforms.

vLLMPrometheusAI inferenceRay Serve
Alibaba Cloud Developer
Written by

Alibaba Cloud Developer

Alibaba's official tech channel, featuring all of its technology innovations.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.