How to Monitor Large Model Applications: A Beginner‑Friendly Metric System
This guide walks you through building a production‑grade monitoring solution for large language model inference services using a three‑layer metric hierarchy, Prometheus, Grafana, DCGM Exporter, and custom Python metrics, with step‑by‑step deployment, alerting policies, and real‑world troubleshooting examples.
1. Overview
Deploying a large‑model inference service is only the first step; real challenges appear once traffic starts flowing. High GPU utilization alone does not guarantee a healthy service – latency matters. The article proposes a layered monitoring system that lets operators detect problems before users notice them.
1.1 Monitoring Layers
Infrastructure layer : hardware health (GPU utilization, memory, temperature, power, ECC errors).
Inference engine layer : model‑service health (QPS, TTFT, E2E latency, KV‑Cache usage, queue depth).
Application layer : business‑level metrics (input/output token counts, error rate, request routing, cost estimation).
The three layers are independent but correlated, enabling fast root‑cause analysis.
2. Detailed Implementation
2.1 Architecture Design
The architecture consists of three scrape jobs (DCGM Exporter, Node Exporter, vLLM) feeding a single Prometheus instance, which stores data locally and optionally forwards it to a remote store (e.g., VictoriaMetrics). Grafana reads from Prometheus to display three dashboards (GPU, inference engine, business).
# Prometheus scrape configuration (simplified)
scrape_configs:
- job_name: "dcgm-exporter"
static_configs:
- targets: ["gpu-node-01:9400", "gpu-node-02:9400"]
scrape_interval: 15s
scrape_timeout: 8s
- job_name: "node-exporter"
static_configs:
- targets: ["gpu-node-01:9100", "gpu-node-02:9100"]
scrape_interval: 15s
scrape_timeout: 8s
- job_name: "vllm"
static_configs:
- targets: ["gpu-node-01:8000", "gpu-node-02:8000"]
scrape_interval: 15s
scrape_timeout: 8s2.2 Infrastructure Metrics
Key DCGM metrics (with English comments) are collected via the official dcgm-exporter Docker image.
# Example Docker run for DCGM Exporter (English comments)
docker run -d \
--name dcgm-exporter \
--gpus all \
--restart always \
-p 9400:9400 \
--cap-add SYS_ADMIN \
nvcr.io/nvidia/k8s/dcgm-exporter:3.6.0-4.5.0-ubuntu22.04 \
-f /etc/dcgm-exporter/dcp-metrics-included.csvImportant metrics include: DCGM_FI_DEV_GPU_UTIL – GPU SM utilization (%). DCGM_FI_DEV_FB_USED and DCGM_FI_DEV_FB_FREE – used and free video memory (MiB). DCGM_FI_DEV_GPU_TEMP – GPU temperature (°C). DCGM_FI_DEV_POWER_USAGE – power draw (W). DCGM_FI_DEV_ECC_SBE_VOL_TOTAL – cumulative ECC single‑bit errors.
2.3 Inference Engine Metrics (vLLM)
vLLM 0.7.x exposes a /metrics endpoint that already provides most needed counters.
# Sample query for QPS (5‑minute rate)
rate(vllm_request_success_total[5m])
# TTFT P99 (seconds)
histogram_quantile(0.99, rate(vllm_time_to_first_token_seconds_bucket[5m]))
# KV‑Cache usage (percentage)
vllm_gpu_cache_usage_perc * 100
# Queue depth (waiting requests)
vllm_num_requests_waitingObserved baselines (from production): TTFT P50 = 150‑250 ms, P95 = 400‑600 ms, P99 = 800‑1200 ms; TTFT > 3 s triggers user complaints, > 5 s is considered service‑unavailable.
2.4 Application‑Level Metrics
Custom Python code uses prometheus_client to expose business metrics such as token consumption, request latency, error counters, and concurrent request gauges. The example below shows the metric definitions and a decorator that records request details.
# Python metric definitions (English comments)
from prometheus_client import Counter, Histogram, Gauge, start_http_server
INPUT_TOKENS = Counter(
"app_input_tokens_total",
"Total input tokens consumed",
["model_name", "tenant_id", "api_key_hash"]
)
OUTPUT_TOKENS = Counter(
"app_output_tokens_total",
"Total output tokens generated",
["model_name", "tenant_id", "api_key_hash"]
)
REQUEST_LATENCY = Histogram(
"app_request_duration_seconds",
"End‑to‑end request duration from application perspective",
["model_name", "endpoint"],
buckets=[0.1,0.25,0.5,1,2,5,10,30,60]
)
REQUEST_ERRORS = Counter(
"app_request_errors_total",
"Total request errors",
["model_name", "error_type", "http_status"]
)
CONCURRENT_REQUESTS = Gauge(
"app_concurrent_requests",
"Current concurrent requests",
["model_name"]
)
def track_request(model_name, tenant_id, api_key_hash):
def decorator(func):
async def wrapper(*args, **kwargs):
CONCURRENT_REQUESTS.labels(model_name=model_name).inc()
start = time.time()
try:
result = await func(*args, **kwargs)
duration = time.time() - start
REQUEST_LATENCY.labels(model_name=model_name, endpoint="/v1/chat/completions").observe(duration)
usage = result.get("usage", {})
INPUT_TOKENS.labels(model_name=model_name, tenant_id=tenant_id, api_key_hash=api_key_hash).inc(
usage.get("prompt_tokens", 0)
)
OUTPUT_TOKENS.labels(model_name=model_name, tenant_id=tenant_id, api_key_hash=api_key_hash).inc(
usage.get("completion_tokens", 0)
)
return result
except Exception:
REQUEST_ERRORS.labels(model_name=model_name, error_type=type(e).__name__, http_status="500").inc()
raise
finally:
CONCURRENT_REQUESTS.labels(model_name=model_name).dec()
return wrapper
return decorator2.5 Grafana Dashboards
The article provides JSON fragments for three dashboards (GPU, inference engine, business). When importing, wrap the panels arrays inside a full dashboard JSON object and POST to the Grafana API.
# Minimal Grafana API import (English comment)
curl -X POST \
-H "Content-Type: application/json" \
-H "Authorization: Bearer YOUR_GRAFANA_API_KEY" \
-d @dashboard.json \
http://localhost:3000/api/dashboards/db2.6 Alerting Rules (P0‑P3)
Alert severity is split into four levels:
P0 – immediate phone call.
P1 – response within 15 minutes.
P2 – response within 1 hour.
P3 – handled next workday.
Key alerts (English descriptions) include:
# P0 – Inference service down
- alert: InferenceServiceDown
expr: up{job="vllm"} == 0
for: 1m
labels:
severity: P0
team: ai-infra
annotations:
summary: "Inference service unreachable on {{ $labels.node }}:{{ $labels.port }}"
description: "vLLM instance {{ $labels.node }}:{{ $labels.port }} has not responded for over 1 minute."
# P1 – TTFT exceeds 5 s
- alert: HighTTFT
expr: histogram_quantile(0.99, rate(vllm_time_to_first_token_seconds_bucket[5m])) > 5
for: 5m
labels:
severity: P1
team: ai-infra
annotations:
summary: "TTFT P99 > 5 s on {{ $labels.node }}:{{ $labels.port }}"
description: "Current TTFT P99 = {{ $value }} s, user experience severely degraded."
# P2 – GPU memory usage > 95 %
- alert: HighGPUMemoryUsage
expr: (DCGM_FI_DEV_FB_USED / (DCGM_FI_DEV_FB_USED + DCGM_FI_DEV_FB_FREE)) * 100 > 95
for: 10m
labels:
severity: P2
team: ai-infra
annotations:
summary: "GPU memory > 95 % on {{ $labels.node }} GPU‑{{ $labels.gpu }}"
description: "Memory utilization at {{ $value }} %, OOM risk high."3. Best Practices & Pitfalls
3.1 Metric Collection Tuning
GPU utilization / memory : scrape every 15 seconds – fast‑changing, need timely detection.
GPU temperature / power : scrape every 30 seconds – relatively slow variation.
vLLM inference metrics : scrape every 15 seconds – latency and QPS can spike quickly.
System metrics (CPU, memory) : scrape every 30 seconds – less volatile.
Business metrics : scrape every 30 seconds – business layer does not require sub‑second precision.
3.2 Label Cardinality Control
Avoid high‑cardinality labels such as request_id, user_id, or trace_id. Keep only stable dimensions like model_name (≤ 10 values) and, if necessary, aggregate tenant_id with recording rules.
3.3 Remote Storage Selection
For long‑term retention, the article compares several options. The production environment uses VictoriaMetrics single‑node (180‑day retention, ~12 GiB storage for four GPU servers).
3.4 Security Hardening
Prometheus 3.x supports native Basic Auth; Grafana is configured with RBAC (Viewer, Editor, Admin). Example snippets show how to generate password hashes and enable HTTPS‑only cookies.
3.5 High Availability
When servers span multiple data centers, a federation topology is recommended: each site runs its own Prometheus, and a central instance scrapes /federate endpoints.
4. Troubleshooting & Common Issues
4.1 Prometheus TSDB Problems
Check prometheus_tsdb_head_series for series count; > 5 million indicates label explosion. Use metric_relabel_configs to drop unnecessary labels. If the WAL is corrupted, stop Prometheus, delete /var/lib/prometheus/wal/*, and restart (data loss of the last few minutes is acceptable).
4.2 Missing Metrics in Grafana
Verify the PromQL query directly in the Prometheus UI. Ensure dashboard variables match the label values (e.g., $node, $gpu) and that the time range is appropriate.
4.3 DCGM Exporter Issues
Make sure the Docker container runs with --gpus all and that the NVIDIA driver version matches the exporter (Driver ≥ 560.28 for Exporter 3.6.0).
5. Conclusion
The presented monitoring system combines hardware health, inference performance, and business impact into a single observable stack. By following the layered metric design, using open‑source Prometheus/Grafana, and applying a disciplined alert‑severity model, teams can detect and resolve issues before they affect end users, keep operational costs under control, and maintain high service reliability for large‑model AI platforms.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Raymond Ops
Linux ops automation, cloud-native, Kubernetes, SRE, DevOps, Python, Golang and related tech discussions.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
