Elastic Deployment and GPU Scheduling for Large‑Model Inference with vLLM on Kubernetes

This article presents a detailed, step‑by‑step analysis of deploying the high‑performance vLLM inference engine on Kubernetes, covering GPU memory management, tensor parallelism, quantization choices, continuous batching, and automated scaling with HPA/KEDA to achieve low latency and high throughput for large language models.

Raymond Ops
Raymond Ops
Raymond Ops
Elastic Deployment and GPU Scheduling for Large‑Model Inference with vLLM on Kubernetes

Overview

Large‑model inference services face four core challenges:

GPU memory management : a 7B FP16 model needs ~14 GB, a 70B model >140 GB. KV‑Cache grows linearly with concurrency, and memory fragmentation can drop utilization below 60%.

High‑concurrency low‑latency : online chat requires sub‑second P99 latency. Static batching wastes resources when request lengths vary.

Elastic scaling : GPU instances cost $2/h (A100). Traffic spikes demand rapid scale‑down to avoid waste.

Multi‑model serving : production often runs several model versions side‑by‑side, needing graceful rollout and traffic splitting.

vLLM (UC Berkeley) solves these with three techniques:

PagedAttention : virtual‑memory paging splits KV‑Cache into fixed‑size blocks, raising GPU memory utilisation from ~50% to >95% and increasing concurrent throughput 2‑4×.

Continuous Batching : finished requests release resources immediately, cutting first‑token latency by 30‑50% versus static batches.

Tensor Parallelism : distributes a single model across multiple GPUs, enabling >70B models on a 4‑GPU node.

Engine Comparison (vLLM vs alternatives)

Throughput : vLLM (PagedAttention) – high; TGI – medium‑high; TensorRT‑LLM – highest (deep optimisation); Ollama – low; llama.cpp – low‑medium.

First‑token latency : vLLM – low; TGI – low; TensorRT‑LLM – lowest; Ollama – medium; llama.cpp – medium.

GPU utilisation : vLLM – 95%+; TGI – 80‑90%; TensorRT‑LLM – 95%+; Ollama – 60‑70%; llama.cpp – 50‑70%.

Ease of use : vLLM – OpenAI‑compatible API, native K8s support; TGI – native; TensorRT‑LLM – requires Triton wrapper; Ollama – highest; llama.cpp – high.

Multi‑GPU support : vLLM – tensor parallel; TGI – tensor parallel; TensorRT‑LLM – tensor + pipeline; Ollama – none; llama.cpp – partial.

For production, vLLM offers the best balance of performance and usability.

GPU Scheduling Basics in Kubernetes

Resources are declared with nvidia.com/gpu: 1.

The NVIDIA GPU Operator installs driver, container toolkit and device plugin automatically.

MIG (Multi‑Instance GPU) can split an A100/H100 into up to seven independent instances.

GPU time‑sharing (multiple Pods on the same GPU) is possible via slice scheduling, suitable for dev environments.

Deployment Walk‑through

1. Prepare a GPU node

# Verify hardware
lspci | grep -i nvidia
# Check driver (>=550 for CUDA 12.4)
nvidia-smi
# Check CUDA version
nvcc --version

2. Install NVIDIA GPU Operator

# Add Helm repo
helm repo add nvidia https://helm.ngc.nvidia.com/nvidia
helm repo update
# Install operator (skip driver install if host already has it)
helm install gpu-operator nvidia/gpu-operator \
  --namespace gpu-operator --create-namespace \
  --set driver.enabled=false \
  --set toolkit.enabled=true \
  --set devicePlugin.enabled=true \
  --set migManager.enabled=true \
  --set dcgmExporter.enabled=true \
  --version v24.6.2
# Wait for all components to become Ready
kubectl -n gpu-operator get pods -w

3. Deploy vLLM as a container (single‑GPU example)

# Pull official image
docker pull vllm/vllm-openai:v0.6.6
# Run a 7B model (Qwen2.5‑7B‑Instruct) on GPU 0
docker run -d \
  --name vllm-server \
  --gpus 'device=0' \
  --shm-size=8g \
  -p 8000:8000 \
  -v /data/models:/models \
  vllm/vllm-openai:v0.6.6 \
  --model /models/Qwen2.5-7B-Instruct \
  --served-model-name qwen2.5-7b \
  --tensor-parallel-size 1 \
  --gpu-memory-utilisation 0.90 \
  --max-model-len 8192 \
  --max-num-seqs 64 \
  --enable-prefix-caching \
  --trust-remote-code

4. Kubernetes Manifests

Key resources: Namespace, PodDisruptionBudget, ConfigMap (centralised launch arguments), Deployment, Service, Ingress, HorizontalPodAutoscaler and optional ScaledObject for KEDA.

# vllm-deployment.yaml (excerpt)
apiVersion: apps/v1
kind: Deployment
metadata:
  name: vllm-qwen25-7b
  namespace: llm-serving
spec:
  replicas: 2
  selector:
    matchLabels:
      app: vllm
      model: qwen25-7b
  template:
    metadata:
      labels:
        app: vllm
        model: qwen25-7b
    spec:
      containers:
      - name: vllm
        image: vllm/vllm-openai:v0.6.6
        command: ["python3", "-m", "vllm.entrypoints.openai.api_server"]
        args:
        - "--model=$(MODEL_PATH)"
        - "--served-model-name=$(SERVED_MODEL_NAME)"
        - "--gpu-memory-utilisation=$(GPU_MEMORY_UTILISATION)"
        - "--max-model-len=$(MAX_MODEL_LEN)"
        - "--max-num-seqs=$(MAX_NUM_SEQS)"
        - "--tensor-parallel-size=$(TENSOR_PARALLEL_SIZE)"
        - "--enable-prefix-caching"
        - "--trust-remote-code"
        envFrom:
        - configMapRef:
            name: vllm-config
        resources:
          limits:
            nvidia.com/gpu: "1"
            memory: "32Gi"
          requests:
            cpu: "4"
            memory: "16Gi"
        ports:
        - containerPort: 8000
        volumeMounts:
        - name: model-storage
          mountPath: /models
          readOnly: true
        - name: shm
          mountPath: /dev/shm
      volumes:
      - name: model-storage
        persistentVolumeClaim:
          claimName: model-pvc
      - name: shm
        emptyDir:
          medium: Memory
          sizeLimit: 8Gi
      terminationGracePeriodSeconds: 120
# Service exposing port 8000
apiVersion: v1
kind: Service
metadata:
  name: vllm-qwen25-7b
  namespace: llm-serving
spec:
  selector:
    app: vllm
    model: qwen25-7b
  ports:
  - name: http
    port: 8000
    targetPort: 8000
    protocol: TCP
  type: ClusterIP
# Ingress with generous timeouts for long‑running generation
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  name: vllm-ingress
  namespace: llm-serving
  annotations:
    nginx.ingress.kubernetes.io/proxy-read-timeout: "300"
    nginx.ingress.kubernetes.io/proxy-send-timeout: "300"
    nginx.ingress.kubernetes.io/proxy-buffering: "off"
spec:
  ingressClassName: nginx
  rules:
  - host: llm-api.example.com
    http:
      paths:
      - path: /v1
        pathType: Prefix
        backend:
          service:
            name: vllm-qwen25-7b
            port:
              number: 8000
# HorizontalPodAutoscaler based on GPU utilisation (requires DCGM Exporter + Prometheus Adapter)
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: vllm-qwen25-7b-hpa
  namespace: llm-serving
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: vllm-qwen25-7b
  minReplicas: 1
  maxReplicas: 8
  metrics:
  - type: Pods
    pods:
      metric:
        name: DCGM_FI_DEV_GPU_UTIL
      target:
        type: AverageValue
        averageValue: "70"
  behavior:
    scaleUp:
      stabilizationWindowSeconds: 30
      policies:
      - type: Pods
        value: 2
        periodSeconds: 60
    scaleDown:
      stabilizationWindowSeconds: 300
      policies:
      - type: Pods
        value: 1
        periodSeconds: 180
# Optional KEDA ScaledObject for event‑driven scaling
apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
  name: vllm-qwen25-7b-scaler
  namespace: llm-serving
spec:
  scaleTargetRef:
    name: vllm-qwen25-7b
  minReplicaCount: 1
  maxReplicaCount: 8
  cooldownPeriod: 300
  pollingInterval: 15
  triggers:
  - type: prometheus
    metadata:
      serverAddress: http://prometheus.monitoring:9090
      metricName: vllm_waiting_requests
      query: |
        avg(vllm:num_requests_waiting{model_name="qwen2.5-7b"})
      threshold: "10"
      activationThreshold: "3"
  - type: prometheus
    metadata:
      serverAddress: http://prometheus.monitoring:9090
      metricName: gpu_memory_util
      query: |
        avg(DCGM_FI_DEV_FB_USED / DCGM_FI_DEV_FB_FREE) by (pod)
      threshold: "0.85"

Key vLLM Launch Parameters

--gpu-memory-utilisation <float>

(default 0.9) – fraction of GPU memory reserved for model weights + KV‑Cache. Lower for multi‑model sharing, raise for single‑model exclusive use. --max-model-len <int> – maximum context length. Reducing this frees KV‑Cache and increases concurrent capacity. --max-num-seqs <int> – maximum concurrent requests (subject to KV‑Cache size). --tensor-parallel-size <int> – number of GPUs used for tensor parallelism; must equal nvidia.com/gpu limit. --quantization <awq|gptq|fp8|none> – choose a quantisation scheme to shrink memory. AWQ 4‑bit is a good default for most GPUs; FP8 is optimal on Hopper (H100) or Ada (L40S). --enable-prefix-caching – reuse KV‑Cache for identical system prompts; useful for RAG or chat bots. --enable-lora and --max-loras – load LoRA adapters at runtime for multi‑tenant fine‑tuned models. --num-scheduler-steps <int> – 1 for low‑latency interactive traffic, 10+ for batch‑oriented workloads.

Memory Optimisation Checklist

Start with --gpu-memory-utilisation 0.90. If CUDA out of memory occurs, lower to 0.85 and monitor nvidia-smi.

Set --max-model-len to the longest request you actually expect (e.g., 8192 for most chat use‑cases). Over‑provisioning wastes KV‑Cache.

Adjust --max-num-seqs to match the KV‑Cache budget after the two steps above.

When GPU memory is still insufficient, switch to a 4‑bit quantisation (AWQ or GPTQ) or, on H100/L40S, use FP8.

Ensure the container has enough shared memory: mount /dev/shm with at least 8 GiB (e.g., emptyDir: medium: Memory, sizeLimit: 8Gi).

Allocate sufficient pod memory limits (7B FP16 ≈ 20 Gi, 70B ≈ 80 Gi) to avoid OOM Killer.

Throughput Optimisation

--max-num-batched-tokens

controls the maximum tokens processed per iteration. Larger values improve batch efficiency for long‑text workloads. --max-num-seqs should be increased for short‑text high‑concurrency scenarios (e.g., 512) and decreased for long‑text workloads.

Enable --enable-prefix-caching when many requests share the same system prompt – reduces first‑token latency by 30‑60%.

Fine‑tune --num-scheduler-steps: use 1 for low latency, 10+ for batch processing.

Cost Control Strategies

Target GPU utilisation 70‑85 %. Below 70 % indicates over‑provisioning; above 85 % may cause request queuing.

Run a minimum of one On‑Demand replica to guarantee baseline capacity, and add Spot/Preemptible replicas for burst traffic.

Configure HPA/KEDA cooldown windows (e.g., scale‑down stabilization 5 min, max 1 pod per 2 min) to avoid thrashing.

High‑Availability Design

Deploy at least two replicas with podAntiAffinity so they land on different nodes.

Use a PodDisruptionBudget with minAvailable: 1 to keep one pod alive during node maintenance.

Health checks:

# Liveness / readiness (port 8000)
httpGet:
  path: /health
  port: 8000
initialDelaySeconds: 300   # 5‑8 min for 70B model load
periodSeconds: 30
failureThreshold: 3

Graceful shutdown: set terminationGracePeriodSeconds: 120 and a preStop hook that removes the pod from the Service before exiting.

Common Errors & Fixes

CUDA out of memory : lower --gpu-memory-utilisation, reduce --max-model-len, or switch to 4‑bit quantisation.

Model not found / tokenizer error : verify the PVC mount path, ensure config.json exists, and check file permissions.

Tensor parallel size does not match : make sure --tensor-parallel-size equals the nvidia.com/gpu limit in the pod spec.

Health check timeout : increase

startupProbe
failureThreshold × periodSeconds

(e.g., 20 × 30 s ≈ 10 min for a 70B model).

torch.cuda.CudaError: invalid device ordinal : restart nvidia-device-plugin and verify nvidia-smi reports continuous GPU IDs.

KV cache too small : decrease --max-model-len or increase --gpu-memory-utilisation.

Connection refused on port 8000 : wait for the readiness probe to pass; model loading can take several minutes.

Monitoring Stack

vLLM exposes Prometheus metrics at /metrics. Recommended stack:

Prometheus scrapes /metrics via a ServiceMonitor.

DCGM Exporter provides GPU hardware metrics (utilisation, temperature, memory usage).

Grafana dashboards visualise:

GPU utilisation (target 70‑85 %).

GPU memory usage (avoid >95 %).

KV‑Cache utilisation (warn >90 %).

Number of waiting requests (alert >20).

TTFT, P99 latency, tokens‑per‑second.

# Example PrometheusRule for vLLM alerts
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
  name: vllm-alerts
  namespace: llm-serving
spec:
  groups:
  - name: vllm.rules
    rules:
    - alert: VLLMKVCacheHigh
      expr: vllm:gpu_cache_usage_perc > 0.9
      for: 3m
      labels:
        severity: warning
      annotations:
        summary: "vLLM KV Cache usage > 90%"
        description: "Pod {{ $labels.pod }} KV Cache at {{ $value | humanizePercentage }}"
    - alert: VLLMRequestQueueHigh
      expr: vllm:num_requests_waiting > 20
      for: 2m
      labels:
        severity: critical
      annotations:
        summary: "vLLM request queue > 20"
        description: "Consider scaling up the deployment"
    - alert: GPUTemperatureHigh
      expr: DCGM_FI_DEV_GPU_TEMP > 85
      for: 5m
      labels:
        severity: warning
      annotations:
        summary: "GPU temperature high"
        description: "GPU {{ $labels.gpu }} at {{ $value }}°C"

Backup & Restore

Store model files on a PersistentVolume (NFS, Lustre, or local SSD) and mount it read‑only in the vLLM pods. This avoids re‑downloading large models after a pod restart.

Keep a timestamped backup copy (hard‑link cp -al) before updating the model, enabling quick rollback.

Recovery steps: delete failing pods, let the Deployment recreate them, verify /health returns 200, and test the OpenAI‑compatible endpoint.

Conclusion

Combining vLLM with Kubernetes, the NVIDIA GPU Operator, and a Prometheus‑based observability stack delivers a production‑ready LLM inference platform. By tuning memory utilisation, KV‑Cache size, and scaling policies, you can achieve low‑latency, high‑throughput serving while keeping GPU costs under control and maintaining high availability.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

DockerQuantizationKubernetesvLLMTensor ParallelismPrometheusGPU schedulingLLM inference
Raymond Ops
Written by

Raymond Ops

Linux ops automation, cloud-native, Kubernetes, SRE, DevOps, Python, Golang and related tech discussions.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.