How to Build a Full‑Stack Observability Solution for AI Inference with Prometheus
This article explores the monitoring challenges of large‑scale AI inference services, outlines the key observability requirements, and provides a complete Prometheus‑based metric collection framework—including Ray Serve and vLLM integrations—to help developers build stable, high‑performance inference applications.
AI Inference Observability Requirements and Pain Points
With the rapid adoption of large language models (LLMs) such as DeepSeek, the demand for AI inference services has grown exponentially, exposing performance bottlenecks in both cloud‑based and self‑hosted deployments. Key observability needs include fine‑grained performance monitoring, resource usage tracking, model load/unload overhead, model behavior monitoring, and distributed architecture health.
Complete Prometheus‑Based Solution
Prometheus offers a multi‑dimensional data model, pull‑based collection, a rich ecosystem of exporters, built‑in alerting, and seamless integration with Grafana for visualization. By leveraging these features, developers can instrument AI inference services end‑to‑end.
Prometheus Advantages
Multi‑dimensional data model with label‑based filtering.
Efficient pull mechanism avoids data loss.
Extensive exporter ecosystem (e.g., Ray Serve, vLLM).
Powerful Alertmanager for rule‑based alerts.
Grafana integration for intuitive dashboards.
Full‑Link Observability Practice
Ray Serve Integration
Ray Serve addresses flexibility, performance, and scalability shortcomings of traditional inference frameworks. Its features include dynamic scaling, multi‑model support, batch processing, and seamless Kubernetes deployment. Ray Serve already exposes a set of built‑in metrics in Prometheus format, such as request counters, error counters, and replica restarts. ray start --head --metrics-export-port=8080 In a Kubernetes environment, a PodMonitor can be used to scrape metrics from both head and worker nodes.
apiVersion: monitoring.coreos.com/v1
kind: PodMonitor
metadata:
name: ray-workers-monitor
labels:
release: prometheus
spec:
jobLabel: ray-workers
namespaceSelector:
matchNames:
- default
selector:
matchLabels:
ray.io/node-type: worker
podMetricsEndpoints:
- port: metricsCustom Metrics in Ray Serve
Beyond the built‑in metrics, developers can define custom counters, gauges, or histograms using ray.serve.metrics to track business‑specific dimensions such as model version, request priority, or token latency.
from ray import serve
from ray.serve import metrics
@serve.deployment
class MyDeployment:
def __init__(self):
self.my_counter = metrics.Counter(
"my_counter",
description="Number of odd‑numbered requests",
tag_keys=("model",)
)
self.my_counter.set_default_tags({"model": "123"})
def __call__(self):
# custom logic
self.my_counter.inc()vLLM Integration
vLLM is a high‑performance LLM inference engine that provides its own set of Prometheus metrics covering system state, iteration statistics, request latency, token processing, and speculative decoding. Example metric groups include:
Running and waiting request counts.
GPU/CPU KV cache usage percentages.
Token processing histograms (prompt, generation, total).
End‑to‑end request latency and queue times.
Speculative decoding acceptance rates and efficiency.
These metrics are exposed via the /metrics endpoint of the vLLM OpenAI‑compatible server.
$ curl http://0.0.0.0:8000/metrics
# HELP vllm:iteration_tokens_total Histogram of number of tokens per engine step.
# TYPE vllm:iteration_tokens_total histogram
vllm:iteration_tokens_total_sum{model_name="unsloth/Llama-3.2-1B-Instruct"} 0.0
...When deployed in Kubernetes, a PodMonitor with the label app: vllm-server can collect these metrics.
apiVersion: monitoring.coreos.com/v1
kind: PodMonitor
metadata:
name: vllm-monitor
labels:
release: prometheus
spec:
jobLabel: vllm-monitor
namespaceSelector:
matchNames:
- default
selector:
matchLabels:
app: vllm-server
podMetricsEndpoints:
- port: metricsInfrastructure‑Level Monitoring
GPU Device Monitoring
Key GPU metrics include utilization, memory usage, and temperature. Consistent collection across Alibaba Cloud GPU services (e.g., Elastic GPU, ECS GPU, PAI) is essential.
Compute Node Monitoring
Monitor high‑performance network, CPU, memory, and disk I/O. Prometheus node discovery can automatically scrape new nodes and apply a full set of probes.
Kubernetes Orchestration Monitoring
Combining Ray Serve and vLLM on Kubernetes leverages Ray’s dynamic scaling and vLLM’s GPU efficiency. Kubernetes HPA can further auto‑scale pods based on load, while Prometheus monitors API server, scheduler, controller, node, and pod metrics, including eBPF‑based network analysis.
Conclusion and Outlook
Full‑link monitoring of AI inference services must cover the entire path from traffic entry to GPU compute. Prometheus provides a flexible, multi‑stack compatible foundation for collecting, alerting, and visualizing these metrics. Future work will integrate AI‑driven analysis of monitoring data to automatically detect bottlenecks and trigger optimization actions, reducing operational complexity and enabling self‑optimizing inference platforms.
Alibaba Cloud Developer
Alibaba's official tech channel, featuring all of its technology innovations.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
