Elastic Deployment and GPU Scheduling for Large‑Model Inference with vLLM on Kubernetes
This article presents a detailed, step‑by‑step analysis of deploying the high‑performance vLLM inference engine on Kubernetes, covering GPU memory management, tensor parallelism, quantization choices, continuous batching, and automated scaling with HPA/KEDA to achieve low latency and high throughput for large language models.
Overview
Large‑model inference services face four core challenges:
GPU memory management : a 7B FP16 model needs ~14 GB, a 70B model >140 GB. KV‑Cache grows linearly with concurrency, and memory fragmentation can drop utilization below 60%.
High‑concurrency low‑latency : online chat requires sub‑second P99 latency. Static batching wastes resources when request lengths vary.
Elastic scaling : GPU instances cost $2/h (A100). Traffic spikes demand rapid scale‑down to avoid waste.
Multi‑model serving : production often runs several model versions side‑by‑side, needing graceful rollout and traffic splitting.
vLLM (UC Berkeley) solves these with three techniques:
PagedAttention : virtual‑memory paging splits KV‑Cache into fixed‑size blocks, raising GPU memory utilisation from ~50% to >95% and increasing concurrent throughput 2‑4×.
Continuous Batching : finished requests release resources immediately, cutting first‑token latency by 30‑50% versus static batches.
Tensor Parallelism : distributes a single model across multiple GPUs, enabling >70B models on a 4‑GPU node.
Engine Comparison (vLLM vs alternatives)
Throughput : vLLM (PagedAttention) – high; TGI – medium‑high; TensorRT‑LLM – highest (deep optimisation); Ollama – low; llama.cpp – low‑medium.
First‑token latency : vLLM – low; TGI – low; TensorRT‑LLM – lowest; Ollama – medium; llama.cpp – medium.
GPU utilisation : vLLM – 95%+; TGI – 80‑90%; TensorRT‑LLM – 95%+; Ollama – 60‑70%; llama.cpp – 50‑70%.
Ease of use : vLLM – OpenAI‑compatible API, native K8s support; TGI – native; TensorRT‑LLM – requires Triton wrapper; Ollama – highest; llama.cpp – high.
Multi‑GPU support : vLLM – tensor parallel; TGI – tensor parallel; TensorRT‑LLM – tensor + pipeline; Ollama – none; llama.cpp – partial.
For production, vLLM offers the best balance of performance and usability.
GPU Scheduling Basics in Kubernetes
Resources are declared with nvidia.com/gpu: 1.
The NVIDIA GPU Operator installs driver, container toolkit and device plugin automatically.
MIG (Multi‑Instance GPU) can split an A100/H100 into up to seven independent instances.
GPU time‑sharing (multiple Pods on the same GPU) is possible via slice scheduling, suitable for dev environments.
Deployment Walk‑through
1. Prepare a GPU node
# Verify hardware
lspci | grep -i nvidia
# Check driver (>=550 for CUDA 12.4)
nvidia-smi
# Check CUDA version
nvcc --version2. Install NVIDIA GPU Operator
# Add Helm repo
helm repo add nvidia https://helm.ngc.nvidia.com/nvidia
helm repo update
# Install operator (skip driver install if host already has it)
helm install gpu-operator nvidia/gpu-operator \
--namespace gpu-operator --create-namespace \
--set driver.enabled=false \
--set toolkit.enabled=true \
--set devicePlugin.enabled=true \
--set migManager.enabled=true \
--set dcgmExporter.enabled=true \
--version v24.6.2
# Wait for all components to become Ready
kubectl -n gpu-operator get pods -w3. Deploy vLLM as a container (single‑GPU example)
# Pull official image
docker pull vllm/vllm-openai:v0.6.6
# Run a 7B model (Qwen2.5‑7B‑Instruct) on GPU 0
docker run -d \
--name vllm-server \
--gpus 'device=0' \
--shm-size=8g \
-p 8000:8000 \
-v /data/models:/models \
vllm/vllm-openai:v0.6.6 \
--model /models/Qwen2.5-7B-Instruct \
--served-model-name qwen2.5-7b \
--tensor-parallel-size 1 \
--gpu-memory-utilisation 0.90 \
--max-model-len 8192 \
--max-num-seqs 64 \
--enable-prefix-caching \
--trust-remote-code4. Kubernetes Manifests
Key resources: Namespace, PodDisruptionBudget, ConfigMap (centralised launch arguments), Deployment, Service, Ingress, HorizontalPodAutoscaler and optional ScaledObject for KEDA.
# vllm-deployment.yaml (excerpt)
apiVersion: apps/v1
kind: Deployment
metadata:
name: vllm-qwen25-7b
namespace: llm-serving
spec:
replicas: 2
selector:
matchLabels:
app: vllm
model: qwen25-7b
template:
metadata:
labels:
app: vllm
model: qwen25-7b
spec:
containers:
- name: vllm
image: vllm/vllm-openai:v0.6.6
command: ["python3", "-m", "vllm.entrypoints.openai.api_server"]
args:
- "--model=$(MODEL_PATH)"
- "--served-model-name=$(SERVED_MODEL_NAME)"
- "--gpu-memory-utilisation=$(GPU_MEMORY_UTILISATION)"
- "--max-model-len=$(MAX_MODEL_LEN)"
- "--max-num-seqs=$(MAX_NUM_SEQS)"
- "--tensor-parallel-size=$(TENSOR_PARALLEL_SIZE)"
- "--enable-prefix-caching"
- "--trust-remote-code"
envFrom:
- configMapRef:
name: vllm-config
resources:
limits:
nvidia.com/gpu: "1"
memory: "32Gi"
requests:
cpu: "4"
memory: "16Gi"
ports:
- containerPort: 8000
volumeMounts:
- name: model-storage
mountPath: /models
readOnly: true
- name: shm
mountPath: /dev/shm
volumes:
- name: model-storage
persistentVolumeClaim:
claimName: model-pvc
- name: shm
emptyDir:
medium: Memory
sizeLimit: 8Gi
terminationGracePeriodSeconds: 120 # Service exposing port 8000
apiVersion: v1
kind: Service
metadata:
name: vllm-qwen25-7b
namespace: llm-serving
spec:
selector:
app: vllm
model: qwen25-7b
ports:
- name: http
port: 8000
targetPort: 8000
protocol: TCP
type: ClusterIP # Ingress with generous timeouts for long‑running generation
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
name: vllm-ingress
namespace: llm-serving
annotations:
nginx.ingress.kubernetes.io/proxy-read-timeout: "300"
nginx.ingress.kubernetes.io/proxy-send-timeout: "300"
nginx.ingress.kubernetes.io/proxy-buffering: "off"
spec:
ingressClassName: nginx
rules:
- host: llm-api.example.com
http:
paths:
- path: /v1
pathType: Prefix
backend:
service:
name: vllm-qwen25-7b
port:
number: 8000 # HorizontalPodAutoscaler based on GPU utilisation (requires DCGM Exporter + Prometheus Adapter)
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: vllm-qwen25-7b-hpa
namespace: llm-serving
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: vllm-qwen25-7b
minReplicas: 1
maxReplicas: 8
metrics:
- type: Pods
pods:
metric:
name: DCGM_FI_DEV_GPU_UTIL
target:
type: AverageValue
averageValue: "70"
behavior:
scaleUp:
stabilizationWindowSeconds: 30
policies:
- type: Pods
value: 2
periodSeconds: 60
scaleDown:
stabilizationWindowSeconds: 300
policies:
- type: Pods
value: 1
periodSeconds: 180 # Optional KEDA ScaledObject for event‑driven scaling
apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
name: vllm-qwen25-7b-scaler
namespace: llm-serving
spec:
scaleTargetRef:
name: vllm-qwen25-7b
minReplicaCount: 1
maxReplicaCount: 8
cooldownPeriod: 300
pollingInterval: 15
triggers:
- type: prometheus
metadata:
serverAddress: http://prometheus.monitoring:9090
metricName: vllm_waiting_requests
query: |
avg(vllm:num_requests_waiting{model_name="qwen2.5-7b"})
threshold: "10"
activationThreshold: "3"
- type: prometheus
metadata:
serverAddress: http://prometheus.monitoring:9090
metricName: gpu_memory_util
query: |
avg(DCGM_FI_DEV_FB_USED / DCGM_FI_DEV_FB_FREE) by (pod)
threshold: "0.85"Key vLLM Launch Parameters
--gpu-memory-utilisation <float>(default 0.9) – fraction of GPU memory reserved for model weights + KV‑Cache. Lower for multi‑model sharing, raise for single‑model exclusive use. --max-model-len <int> – maximum context length. Reducing this frees KV‑Cache and increases concurrent capacity. --max-num-seqs <int> – maximum concurrent requests (subject to KV‑Cache size). --tensor-parallel-size <int> – number of GPUs used for tensor parallelism; must equal nvidia.com/gpu limit. --quantization <awq|gptq|fp8|none> – choose a quantisation scheme to shrink memory. AWQ 4‑bit is a good default for most GPUs; FP8 is optimal on Hopper (H100) or Ada (L40S). --enable-prefix-caching – reuse KV‑Cache for identical system prompts; useful for RAG or chat bots. --enable-lora and --max-loras – load LoRA adapters at runtime for multi‑tenant fine‑tuned models. --num-scheduler-steps <int> – 1 for low‑latency interactive traffic, 10+ for batch‑oriented workloads.
Memory Optimisation Checklist
Start with --gpu-memory-utilisation 0.90. If CUDA out of memory occurs, lower to 0.85 and monitor nvidia-smi.
Set --max-model-len to the longest request you actually expect (e.g., 8192 for most chat use‑cases). Over‑provisioning wastes KV‑Cache.
Adjust --max-num-seqs to match the KV‑Cache budget after the two steps above.
When GPU memory is still insufficient, switch to a 4‑bit quantisation (AWQ or GPTQ) or, on H100/L40S, use FP8.
Ensure the container has enough shared memory: mount /dev/shm with at least 8 GiB (e.g., emptyDir: medium: Memory, sizeLimit: 8Gi).
Allocate sufficient pod memory limits (7B FP16 ≈ 20 Gi, 70B ≈ 80 Gi) to avoid OOM Killer.
Throughput Optimisation
--max-num-batched-tokenscontrols the maximum tokens processed per iteration. Larger values improve batch efficiency for long‑text workloads. --max-num-seqs should be increased for short‑text high‑concurrency scenarios (e.g., 512) and decreased for long‑text workloads.
Enable --enable-prefix-caching when many requests share the same system prompt – reduces first‑token latency by 30‑60%.
Fine‑tune --num-scheduler-steps: use 1 for low latency, 10+ for batch processing.
Cost Control Strategies
Target GPU utilisation 70‑85 %. Below 70 % indicates over‑provisioning; above 85 % may cause request queuing.
Run a minimum of one On‑Demand replica to guarantee baseline capacity, and add Spot/Preemptible replicas for burst traffic.
Configure HPA/KEDA cooldown windows (e.g., scale‑down stabilization 5 min, max 1 pod per 2 min) to avoid thrashing.
High‑Availability Design
Deploy at least two replicas with podAntiAffinity so they land on different nodes.
Use a PodDisruptionBudget with minAvailable: 1 to keep one pod alive during node maintenance.
Health checks:
# Liveness / readiness (port 8000)
httpGet:
path: /health
port: 8000
initialDelaySeconds: 300 # 5‑8 min for 70B model load
periodSeconds: 30
failureThreshold: 3Graceful shutdown: set terminationGracePeriodSeconds: 120 and a preStop hook that removes the pod from the Service before exiting.
Common Errors & Fixes
CUDA out of memory : lower --gpu-memory-utilisation, reduce --max-model-len, or switch to 4‑bit quantisation.
Model not found / tokenizer error : verify the PVC mount path, ensure config.json exists, and check file permissions.
Tensor parallel size does not match : make sure --tensor-parallel-size equals the nvidia.com/gpu limit in the pod spec.
Health check timeout : increase
startupProbe failureThreshold × periodSeconds(e.g., 20 × 30 s ≈ 10 min for a 70B model).
torch.cuda.CudaError: invalid device ordinal : restart nvidia-device-plugin and verify nvidia-smi reports continuous GPU IDs.
KV cache too small : decrease --max-model-len or increase --gpu-memory-utilisation.
Connection refused on port 8000 : wait for the readiness probe to pass; model loading can take several minutes.
Monitoring Stack
vLLM exposes Prometheus metrics at /metrics. Recommended stack:
Prometheus scrapes /metrics via a ServiceMonitor.
DCGM Exporter provides GPU hardware metrics (utilisation, temperature, memory usage).
Grafana dashboards visualise:
GPU utilisation (target 70‑85 %).
GPU memory usage (avoid >95 %).
KV‑Cache utilisation (warn >90 %).
Number of waiting requests (alert >20).
TTFT, P99 latency, tokens‑per‑second.
# Example PrometheusRule for vLLM alerts
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
name: vllm-alerts
namespace: llm-serving
spec:
groups:
- name: vllm.rules
rules:
- alert: VLLMKVCacheHigh
expr: vllm:gpu_cache_usage_perc > 0.9
for: 3m
labels:
severity: warning
annotations:
summary: "vLLM KV Cache usage > 90%"
description: "Pod {{ $labels.pod }} KV Cache at {{ $value | humanizePercentage }}"
- alert: VLLMRequestQueueHigh
expr: vllm:num_requests_waiting > 20
for: 2m
labels:
severity: critical
annotations:
summary: "vLLM request queue > 20"
description: "Consider scaling up the deployment"
- alert: GPUTemperatureHigh
expr: DCGM_FI_DEV_GPU_TEMP > 85
for: 5m
labels:
severity: warning
annotations:
summary: "GPU temperature high"
description: "GPU {{ $labels.gpu }} at {{ $value }}°C"Backup & Restore
Store model files on a PersistentVolume (NFS, Lustre, or local SSD) and mount it read‑only in the vLLM pods. This avoids re‑downloading large models after a pod restart.
Keep a timestamped backup copy (hard‑link cp -al) before updating the model, enabling quick rollback.
Recovery steps: delete failing pods, let the Deployment recreate them, verify /health returns 200, and test the OpenAI‑compatible endpoint.
Conclusion
Combining vLLM with Kubernetes, the NVIDIA GPU Operator, and a Prometheus‑based observability stack delivers a production‑ready LLM inference platform. By tuning memory utilisation, KV‑Cache size, and scaling policies, you can achieve low‑latency, high‑throughput serving while keeping GPU costs under control and maintaining high availability.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Raymond Ops
Linux ops automation, cloud-native, Kubernetes, SRE, DevOps, Python, Golang and related tech discussions.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
