Operations 42 min read

How to Monitor Large Model Applications: A Beginner‑Friendly Metric System

This guide walks you through building a production‑grade monitoring solution for large language model inference services using a three‑layer metric hierarchy, Prometheus, Grafana, DCGM Exporter, and custom Python metrics, with step‑by‑step deployment, alerting policies, and real‑world troubleshooting examples.

Raymond Ops
Raymond Ops
Raymond Ops
How to Monitor Large Model Applications: A Beginner‑Friendly Metric System

1. Overview

Deploying a large‑model inference service is only the first step; real challenges appear once traffic starts flowing. High GPU utilization alone does not guarantee a healthy service – latency matters. The article proposes a layered monitoring system that lets operators detect problems before users notice them.

1.1 Monitoring Layers

Infrastructure layer : hardware health (GPU utilization, memory, temperature, power, ECC errors).

Inference engine layer : model‑service health (QPS, TTFT, E2E latency, KV‑Cache usage, queue depth).

Application layer : business‑level metrics (input/output token counts, error rate, request routing, cost estimation).

The three layers are independent but correlated, enabling fast root‑cause analysis.

2. Detailed Implementation

2.1 Architecture Design

The architecture consists of three scrape jobs (DCGM Exporter, Node Exporter, vLLM) feeding a single Prometheus instance, which stores data locally and optionally forwards it to a remote store (e.g., VictoriaMetrics). Grafana reads from Prometheus to display three dashboards (GPU, inference engine, business).

# Prometheus scrape configuration (simplified)
scrape_configs:
  - job_name: "dcgm-exporter"
    static_configs:
      - targets: ["gpu-node-01:9400", "gpu-node-02:9400"]
    scrape_interval: 15s
    scrape_timeout: 8s
  - job_name: "node-exporter"
    static_configs:
      - targets: ["gpu-node-01:9100", "gpu-node-02:9100"]
    scrape_interval: 15s
    scrape_timeout: 8s
  - job_name: "vllm"
    static_configs:
      - targets: ["gpu-node-01:8000", "gpu-node-02:8000"]
    scrape_interval: 15s
    scrape_timeout: 8s

2.2 Infrastructure Metrics

Key DCGM metrics (with English comments) are collected via the official dcgm-exporter Docker image.

# Example Docker run for DCGM Exporter (English comments)
docker run -d \
  --name dcgm-exporter \
  --gpus all \
  --restart always \
  -p 9400:9400 \
  --cap-add SYS_ADMIN \
  nvcr.io/nvidia/k8s/dcgm-exporter:3.6.0-4.5.0-ubuntu22.04 \
  -f /etc/dcgm-exporter/dcp-metrics-included.csv

Important metrics include: DCGM_FI_DEV_GPU_UTIL – GPU SM utilization (%). DCGM_FI_DEV_FB_USED and DCGM_FI_DEV_FB_FREE – used and free video memory (MiB). DCGM_FI_DEV_GPU_TEMP – GPU temperature (°C). DCGM_FI_DEV_POWER_USAGE – power draw (W). DCGM_FI_DEV_ECC_SBE_VOL_TOTAL – cumulative ECC single‑bit errors.

2.3 Inference Engine Metrics (vLLM)

vLLM 0.7.x exposes a /metrics endpoint that already provides most needed counters.

# Sample query for QPS (5‑minute rate)
rate(vllm_request_success_total[5m])

# TTFT P99 (seconds)
histogram_quantile(0.99, rate(vllm_time_to_first_token_seconds_bucket[5m]))

# KV‑Cache usage (percentage)
vllm_gpu_cache_usage_perc * 100

# Queue depth (waiting requests)
vllm_num_requests_waiting

Observed baselines (from production): TTFT P50 = 150‑250 ms, P95 = 400‑600 ms, P99 = 800‑1200 ms; TTFT > 3 s triggers user complaints, > 5 s is considered service‑unavailable.

2.4 Application‑Level Metrics

Custom Python code uses prometheus_client to expose business metrics such as token consumption, request latency, error counters, and concurrent request gauges. The example below shows the metric definitions and a decorator that records request details.

# Python metric definitions (English comments)
from prometheus_client import Counter, Histogram, Gauge, start_http_server

INPUT_TOKENS = Counter(
    "app_input_tokens_total",
    "Total input tokens consumed",
    ["model_name", "tenant_id", "api_key_hash"]
)
OUTPUT_TOKENS = Counter(
    "app_output_tokens_total",
    "Total output tokens generated",
    ["model_name", "tenant_id", "api_key_hash"]
)
REQUEST_LATENCY = Histogram(
    "app_request_duration_seconds",
    "End‑to‑end request duration from application perspective",
    ["model_name", "endpoint"],
    buckets=[0.1,0.25,0.5,1,2,5,10,30,60]
)
REQUEST_ERRORS = Counter(
    "app_request_errors_total",
    "Total request errors",
    ["model_name", "error_type", "http_status"]
)
CONCURRENT_REQUESTS = Gauge(
    "app_concurrent_requests",
    "Current concurrent requests",
    ["model_name"]
)

def track_request(model_name, tenant_id, api_key_hash):
    def decorator(func):
        async def wrapper(*args, **kwargs):
            CONCURRENT_REQUESTS.labels(model_name=model_name).inc()
            start = time.time()
            try:
                result = await func(*args, **kwargs)
                duration = time.time() - start
                REQUEST_LATENCY.labels(model_name=model_name, endpoint="/v1/chat/completions").observe(duration)
                usage = result.get("usage", {})
                INPUT_TOKENS.labels(model_name=model_name, tenant_id=tenant_id, api_key_hash=api_key_hash).inc(
                    usage.get("prompt_tokens", 0)
                )
                OUTPUT_TOKENS.labels(model_name=model_name, tenant_id=tenant_id, api_key_hash=api_key_hash).inc(
                    usage.get("completion_tokens", 0)
                )
                return result
            except Exception:
                REQUEST_ERRORS.labels(model_name=model_name, error_type=type(e).__name__, http_status="500").inc()
                raise
            finally:
                CONCURRENT_REQUESTS.labels(model_name=model_name).dec()
        return wrapper
    return decorator

2.5 Grafana Dashboards

The article provides JSON fragments for three dashboards (GPU, inference engine, business). When importing, wrap the panels arrays inside a full dashboard JSON object and POST to the Grafana API.

# Minimal Grafana API import (English comment)
curl -X POST \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer YOUR_GRAFANA_API_KEY" \
  -d @dashboard.json \
  http://localhost:3000/api/dashboards/db

2.6 Alerting Rules (P0‑P3)

Alert severity is split into four levels:

P0 – immediate phone call.

P1 – response within 15 minutes.

P2 – response within 1 hour.

P3 – handled next workday.

Key alerts (English descriptions) include:

# P0 – Inference service down
- alert: InferenceServiceDown
  expr: up{job="vllm"} == 0
  for: 1m
  labels:
    severity: P0
    team: ai-infra
  annotations:
    summary: "Inference service unreachable on {{ $labels.node }}:{{ $labels.port }}"
    description: "vLLM instance {{ $labels.node }}:{{ $labels.port }} has not responded for over 1 minute."

# P1 – TTFT exceeds 5 s
- alert: HighTTFT
  expr: histogram_quantile(0.99, rate(vllm_time_to_first_token_seconds_bucket[5m])) > 5
  for: 5m
  labels:
    severity: P1
    team: ai-infra
  annotations:
    summary: "TTFT P99 > 5 s on {{ $labels.node }}:{{ $labels.port }}"
    description: "Current TTFT P99 = {{ $value }} s, user experience severely degraded."

# P2 – GPU memory usage > 95 %
- alert: HighGPUMemoryUsage
  expr: (DCGM_FI_DEV_FB_USED / (DCGM_FI_DEV_FB_USED + DCGM_FI_DEV_FB_FREE)) * 100 > 95
  for: 10m
  labels:
    severity: P2
    team: ai-infra
  annotations:
    summary: "GPU memory > 95 % on {{ $labels.node }} GPU‑{{ $labels.gpu }}"
    description: "Memory utilization at {{ $value }} %, OOM risk high."

3. Best Practices & Pitfalls

3.1 Metric Collection Tuning

GPU utilization / memory : scrape every 15 seconds – fast‑changing, need timely detection.

GPU temperature / power : scrape every 30 seconds – relatively slow variation.

vLLM inference metrics : scrape every 15 seconds – latency and QPS can spike quickly.

System metrics (CPU, memory) : scrape every 30 seconds – less volatile.

Business metrics : scrape every 30 seconds – business layer does not require sub‑second precision.

3.2 Label Cardinality Control

Avoid high‑cardinality labels such as request_id, user_id, or trace_id. Keep only stable dimensions like model_name (≤ 10 values) and, if necessary, aggregate tenant_id with recording rules.

3.3 Remote Storage Selection

For long‑term retention, the article compares several options. The production environment uses VictoriaMetrics single‑node (180‑day retention, ~12 GiB storage for four GPU servers).

3.4 Security Hardening

Prometheus 3.x supports native Basic Auth; Grafana is configured with RBAC (Viewer, Editor, Admin). Example snippets show how to generate password hashes and enable HTTPS‑only cookies.

3.5 High Availability

When servers span multiple data centers, a federation topology is recommended: each site runs its own Prometheus, and a central instance scrapes /federate endpoints.

4. Troubleshooting & Common Issues

4.1 Prometheus TSDB Problems

Check prometheus_tsdb_head_series for series count; > 5 million indicates label explosion. Use metric_relabel_configs to drop unnecessary labels. If the WAL is corrupted, stop Prometheus, delete /var/lib/prometheus/wal/*, and restart (data loss of the last few minutes is acceptable).

4.2 Missing Metrics in Grafana

Verify the PromQL query directly in the Prometheus UI. Ensure dashboard variables match the label values (e.g., $node, $gpu) and that the time range is appropriate.

4.3 DCGM Exporter Issues

Make sure the Docker container runs with --gpus all and that the NVIDIA driver version matches the exporter (Driver ≥ 560.28 for Exporter 3.6.0).

5. Conclusion

The presented monitoring system combines hardware health, inference performance, and business impact into a single observable stack. By following the layered metric design, using open‑source Prometheus/Grafana, and applying a disciplined alert‑severity model, teams can detect and resolve issues before they affect end users, keep operational costs under control, and maintain high service reliability for large‑model AI platforms.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

Monitoringlarge language modelsvllmprometheusAI Infrastructuregrafanadcgm
Raymond Ops
Written by

Raymond Ops

Linux ops automation, cloud-native, Kubernetes, SRE, DevOps, Python, Golang and related tech discussions.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.