How to Deploy Scalable LLM Inference with vLLM on Kubernetes and GPU Scheduling

This guide explains how to deploy vLLM for large‑language‑model serving on Kubernetes, covering GPU resource management, tensor‑parallel configuration, continuous batching, quantization choices, autoscaling with HPA and KEDA, multi‑model routing, and best‑practice recommendations for performance, cost control, and high availability.

MaGe Linux Operations
MaGe Linux Operations
MaGe Linux Operations
How to Deploy Scalable LLM Inference with vLLM on Kubernetes and GPU Scheduling

Overview

Large language model (LLM) inference in production faces four core challenges: GPU memory consumption, low‑latency high‑throughput requirements, elastic scaling, and multi‑model management.

Key Technical Challenges

GPU memory : a 7B FP16 model needs ~14 GB, a 70B model exceeds 140 GB. KV‑cache fragmentation can drop utilization below 60%.

Latency & throughput : P99 latency must stay under a few seconds while serving many concurrent requests.

Elastic scaling : GPU instances are expensive; resources should shrink during traffic valleys.

Multi‑model service : production often runs several model versions simultaneously.

vLLM Core Technologies

PagedAttention : adopts OS‑style virtual memory paging for KV‑cache, raising GPU utilization from ~50% to >95% and increasing concurrent throughput 2‑4×.

Continuous Batching : releases finished requests immediately, reducing latency by 30‑50% compared with static batching.

Tensor Parallelism : splits attention heads across GPUs, enabling models that do not fit on a single card.

Feature Comparison (selected)

vLLM – high throughput, low first‑token latency, OpenAI‑compatible API, native Kubernetes support.

TGI – moderate throughput, similar latency, native K8s support.

TensorRT‑LLM – highest raw performance, but no native K8s integration.

Ollama – low throughput, strong local‑dev focus.

llama.cpp – CPU‑only, low throughput.

GPU Scheduling Basics

Kubernetes uses the NVIDIA Device Plugin to expose nvidia.com/gpu resources. The NVIDIA GPU Operator automates driver, container‑toolkit, and device‑plugin installation. MIG can partition a GPU into up to seven independent instances, and GPU time‑slice sharing allows multiple pods to share a card for development.

Environment Requirements

vLLM 0.6.x (or newer)

CUDA 12.1+ (driver ≥ 550)

Kubernetes 1.31+

NVIDIA GPU Operator 24.6+

GPU hardware: A100/H100/L40S/A10 (Ampere or newer)

Detailed Deployment Steps

1. GPU Node Preparation

Verify driver and CUDA versions with nvidia-smi and nvcc --version.

Deploy the NVIDIA GPU Operator via Helm (disable driver if already installed):

# Add Helm repo
helm repo add nvidia https://helm.ngc.nvidia.com/nvidia
helm repo update
# Install operator (driver disabled if present)
helm install gpu-operator nvidia/gpu-operator \
  --namespace gpu-operator --create-namespace \
  --set driver.enabled=false \
  --set toolkit.enabled=true \
  --set devicePlugin.enabled=true \
  --set migManager.enabled=true \
  --set dcgmExporter.enabled=true \
  --version v24.6.2
# Wait for all pods to become ready
kubectl -n gpu-operator get pods -w

Confirm GPU resources are visible to the scheduler:

# Show node capacity
kubectl describe node <em>NODE_NAME</em> | grep -A5 "Capacity"
# Run a test pod
kubectl run gpu-test --rm -it --restart=Never \
  --image=nvidia/cuda:12.4.0-base-ubuntu22.04 \
  --limits=nvidia.com/gpu=1 \
  -- nvidia-smi

Optional MIG configuration (A100 example):

# Enable MIG
sudo nvidia-smi mig -lgip
# Create 7 GiB instances (7 partitions)
sudo nvidia-smi -i 0 -mig 1
sudo nvidia-smi mig -i 0 -cgi 19,19,19,19,19,19,19 -C
# Adjust MIG policy in the GPU Operator ConfigMap
kubectl -n gpu-operator edit configmap mig-parted-config

2. vLLM Single‑Node Deployment

Pull the official image and start the container:

# Pull image
docker pull vllm/vllm-openai:v0.6.6
# Run container (example for Qwen2.5‑7B)
docker run -d \
  --name vllm-server \
  --gpus "device=0" \
  --shm-size=8g \
  -p 8000:8000 \
  -v /data/models:/models \
  vllm/vllm-openai:v0.6.6 \
  --model /models/Qwen2.5-7B-Instruct \
  --served-model-name qwen2.5-7b \
  --tensor-parallel-size 1 \
  --gpu-memory-utilization 0.90 \
  --max-model-len 8192 \
  --max-num-seqs 64 \
  --enable-prefix-caching \
  --trust-remote-code

Important startup parameters: --gpu-memory-utilization (default 0.9) – fraction of GPU memory reserved for model weights + KV‑cache. --max-model-len – maximum context length; smaller values free KV‑cache for more concurrent requests. --max-num-seqs – maximum concurrent sequences; limited by available KV‑cache memory. --tensor-parallel-size – number of GPUs used for tensor parallelism; must match nvidia.com/gpu limit. --enable-prefix-caching – reuse KV‑cache for identical system prompts, reducing first‑token latency.

3. Kubernetes Deployment

Create a dedicated namespace and a PodDisruptionBudget to keep at least one replica during updates:

apiVersion: v1
kind: Namespace
metadata:
  name: llm-serving
---
apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
  name: vllm-pdb
  namespace: llm-serving
spec:
  minAvailable: 1
  selector:
    matchLabels:
      app: vllm
      model: qwen2.5-7b

Store runtime parameters in a ConfigMap so they can be changed without rebuilding the image:

apiVersion: v1
kind: ConfigMap
metadata:
  name: vllm-config
  namespace: llm-serving
data:
  MODEL_PATH: "/models/Qwen2.5-7B-Instruct"
  SERVED_MODEL_NAME: "qwen2.5-7b"
  GPU_MEMORY_UTILIZATION: "0.90"
  MAX_MODEL_LEN: "8192"
  MAX_NUM_SEQS: "64"
  TENSOR_PARALLEL_SIZE: "1"

Deployment manifest (single‑GPU example):

apiVersion: apps/v1
kind: Deployment
metadata:
  name: vllm-qwen2.5-7b
  namespace: llm-serving
spec:
  replicas: 2
  selector:
    matchLabels:
      app: vllm
      model: qwen2.5-7b
  template:
    metadata:
      labels:
        app: vllm
        model: qwen2.5-7b
    spec:
      containers:
      - name: vllm
        image: vllm/vllm-openai:v0.6.6
        command: ["python3", "-m", "vllm.entrypoints.openai.api_server"]
        args:
        - "--model=$(MODEL_PATH)"
        - "--served-model-name=$(SERVED_MODEL_NAME)"
        - "--gpu-memory-utilization=$(GPU_MEMORY_UTILIZATION)"
        - "--max-model-len=$(MAX_MODEL_LEN)"
        - "--max-num-seqs=$(MAX_NUM_SEQS)"
        - "--tensor-parallel-size=$(TENSOR_PARALLEL_SIZE)"
        - "--enable-prefix-caching"
        - "--trust-remote-code"
        envFrom:
        - configMapRef:
            name: vllm-config
        resources:
          limits:
            nvidia.com/gpu: "1"
            memory: "32Gi"
          requests:
            cpu: "4"
            memory: "16Gi"
        ports:
        - containerPort: 8000
        volumeMounts:
        - name: model-storage
          mountPath: /models
          readOnly: true
        - name: shm
          mountPath: /dev/shm
      volumes:
      - name: model-storage
        persistentVolumeClaim:
          claimName: model-pvc
      - name: shm
        emptyDir:
          medium: Memory
          sizeLimit: 8Gi
      terminationGracePeriodSeconds: 120

Key points:

Use nvidia.com/gpu limits to request a GPU.

Mount a shared /dev/shm volume (≥8 Gi) because vLLM relies on shared memory for KV‑cache.

Set a generous terminationGracePeriodSeconds (e.g., 120 s) to allow in‑flight requests to finish.

Expose the service:

apiVersion: v1
kind: Service
metadata:
  name: vllm-qwen2.5-7b
  namespace: llm-serving
spec:
  selector:
    app: vllm
    model: qwen2.5-7b
  ports:
  - port: 8000
    targetPort: 8000
    protocol: TCP
  type: ClusterIP

Optional Ingress (adjust timeouts for long generation):

apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  name: vllm-ingress
  namespace: llm-serving
  annotations:
    nginx.ingress.kubernetes.io/proxy-read-timeout: "300"
    nginx.ingress.kubernetes.io/proxy-send-timeout: "300"
    nginx.ingress.kubernetes.io/proxy-buffering: "off"
spec:
  ingressClassName: nginx
  rules:
  - host: llm-api.example.com
    http:
      paths:
      - path: /v1
        pathType: Prefix
        backend:
          service:
            name: vllm-qwen2.5-7b
            port:
              number: 8000

Horizontal Pod Autoscaler (GPU‑aware) to scale based on GPU utilization:

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: vllm-hpa
  namespace: llm-serving
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: vllm-qwen2.5-7b
  minReplicas: 1
  maxReplicas: 8
  metrics:
  - type: Pods
    pods:
      metric:
        name: DCGM_FI_DEV_GPU_UTIL
      target:
        type: AverageValue
        averageValue: "70"
  behavior:
    scaleUp:
      stabilizationWindowSeconds: 30
      policies:
      - type: Pods
        value: 2
        periodSeconds: 60
    scaleDown:
      stabilizationWindowSeconds: 300
      policies:
      - type: Pods
        value: 1
        periodSeconds: 180

KEDA can provide queue‑aware scaling using Prometheus metrics (e.g., waiting request count, P95 latency).

Best‑Practice Recommendations

Memory optimization

Adjust --gpu-memory-utilization between 0.85‑0.92 depending on whether the GPU is dedicated.

Set --max-model-len to the longest context actually needed; shorter limits free KV‑cache for more concurrent requests.

If GPU memory is insufficient, switch to a quantized model (AWQ 4‑bit, GPTQ 4‑bit, or FP8 on Hopper).

Provide a large /dev/shm volume (8‑16 Gi) for tensor‑parallel workloads.

Throughput tuning

Increase --max-num-batched-tokens for mixed‑length workloads to improve batch efficiency.

Raise --max-num-seqs for short‑request, high‑concurrency scenarios.

Enable --enable-prefix-caching when many requests share the same system prompt.

Fine‑tune --num-scheduler-steps: use 1 for low latency, 10 for batch‑oriented workloads.

Cost control

Target GPU utilization between 70 % and 85 %; set Prometheus alerts on DCGM_FI_DEV_GPU_UTIL > 90 % for 5 min.

Combine on‑demand and Spot instances: keep a baseline of on‑demand pods and let Spot pods handle traffic spikes.

Configure HPA/KEDA cooldown periods to avoid rapid scaling churn.

Multi‑model serving

Deploy a separate Deployment per model version; use HTTP headers (e.g., x-model-id) or Gateway API routing to direct traffic.

On A100/H100, consider MIG partitions to isolate models at the hardware level.

For development, GPU time‑slice sharing (no MIG) is acceptable.

High availability

Run at least two replicas with anti‑affinity to spread pods across nodes.

Configure readiness, liveness, and startup probes; give a long initialDelaySeconds (e.g., 300 s for 70B models) and a high failureThreshold for the startup probe.

Set terminationGracePeriodSeconds to 120 s and add a preStop hook that drains traffic before the pod exits.

Persist model files on a PVC (ReadOnlyMany) so a restarted pod does not need to re‑download the model.

Troubleshooting Checklist

CUDA out of memory : lower --gpu-memory-utilization, reduce --max-model-len, or switch to a quantized model.

Model not found : verify the PVC mount path and ensure config.json and tokenizer.json exist.

Tensor parallel size mismatch : --tensor-parallel-size must equal the nvidia.com/gpu limit.

Health‑check timeout : increase startupProbe.failureThreshold and periodSeconds (up to 600 s for 70B models).

OOMKilled : raise resources.limits.memory (≥20 Gi for 7B, ≥80 Gi for 70B).

GPU Diagnostics

# Show driver and CUDA version
nvidia-smi --query-gpu=driver_version --format=csv,noheader | head -1
# Show utilization and memory usage
nvidia-smi --query-gpu=index,name,utilization.gpu,memory.used,memory.total,temperature.gpu,power.draw --format=csv
# List processes using the GPU
nvidia-smi --query-compute-apps=pid,process_name,used_gpu_memory --format=csv

Monitoring Stack

vLLM exposes Prometheus metrics at /metrics. Deploy a ServiceMonitor to scrape them.

Deploy DCGM Exporter to collect GPU hardware metrics (utilization, temperature, power, ECC).

Define PrometheusRule alerts for:

KV‑cache usage > 90 % for 3 min.

Request queue length > 20 for 2 min.

GPU utilization > 90 % for 5 min.

GPU temperature > 85 °C for 5 min.

Build Grafana dashboards that combine vLLM metrics (running requests, waiting requests, token throughput) with DCGM metrics.

Backup & Restore

Store model files on a PVC (ReadOnlyMany) and back up the PVC regularly (e.g., rsync to a secondary storage).

Validate config.json and tokenizer.json before marking a backup as successful.

If a pod fails, delete it or perform a rollout restart; the Deployment controller will recreate it and the /health endpoint can be used to verify readiness before sending traffic.

Conclusion

Deploying vLLM on Kubernetes with the NVIDIA GPU Operator provides a production‑grade LLM inference platform that balances latency, throughput, and cost. By carefully tuning memory allocation ( --gpu-memory-utilization, --max-model-len, --max-num-seqs), leveraging continuous batching and optional prefix caching, and integrating GPU‑aware autoscaling (HPA + KEDA), operators can achieve high GPU utilization while maintaining reliability through health probes, PodDisruptionBudgets, and robust monitoring.

KubernetesvLLMGPULLM inferenceScaling
MaGe Linux Operations
Written by

MaGe Linux Operations

Founded in 2009, MaGe Education is a top Chinese high‑end IT training brand. Its graduates earn 12K+ RMB salaries, and the school has trained tens of thousands of students. It offers high‑pay courses in Linux cloud operations, Python full‑stack, automation, data analysis, AI, and Go high‑concurrency architecture. Thanks to quality courses and a solid reputation, it has talent partnerships with numerous internet firms.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.