How to Build a Resilient GPU Inference Autoscaling System on Kubernetes
This article explains why scaling GPU inference services on Kubernetes is challenging and presents a multi‑layer control architecture, metric upgrades, and production‑ready implementations using HPA, KEDA, KServe, and Karpenter to achieve stable, cost‑effective autoscaling.
Why GPU Inference Autoscaling Is Hard
Running inference on GPUs in Kubernetes requires balancing latency, throughput, cost, and stability. GPU resources are discrete, expensive, and have long cold‑start times, so naive Horizontal Pod Autoscaler (HPA) configurations are insufficient.
GPU is a discrete, strongly‑constrained resource; you either get a whole card or none.
Model loading and container start‑up are slow (seconds to minutes).
GPU utilization alone can mislead; queue length, first‑token latency, P95/P99 latency, KV‑cache usage, and active request count drive user experience.
Four‑Layer Elastic Control System
Service layer : Pod horizontal scaling.
Resource layer : GPU node supply and reclamation.
Scheduling layer : GPU model, topology, MIG/vGPU slicing.
Application layer : Dynamic batching, concurrency control, model warm‑up, queue protection.
Observability layer : Upgrade from raw GPU metrics to business‑level SLO metrics.
Production Goals
Performance : keep P95 latency within limits, avoid P99 latency avalanches, stabilize first‑token time (TTFT) for large models, and keep queue wait time under business thresholds.
Cost : maximize average GPU utilization, recycle nodes during low‑load periods, use Spot GPUs for non‑critical traffic, and increase density with MIG/vGPU for small models.
Stability : new Pods must finish model warm‑up before becoming ready, scaling‑down must never interrupt ongoing inference sessions, and node failures must not make an entire model unavailable.
Engineering governance : standardize and reuse scaling logic, make metrics auditable and thresholds explainable, decouple service scaling from node scaling, and support multiple inference frameworks (Triton, vLLM, TensorRT‑LLM, custom FastAPI/Grpc).
Recommended Architecture: Four‑Layer Elastic Closed‑Loop
Application Layer
API Gateway / Ingress.
Inference service (vLLM, Triton, custom server).
Model management (weight download, version routing, canary).
Application metric exporter (queue length, TTFT, tokens/s, batch size, KV‑cache usage).
Metrics Layer
Prometheus collects two categories of metrics:
Basic resource metrics: GPU utilization, memory, temperature, power, PCIe/NVLink.
Business metrics: queue length, concurrent requests, TTFT, generation throughput, error rate.
Control Layer
HPA for steady workloads using standard/custom metrics.
KEDA for bursty, queue‑driven traffic (supports scale‑to‑zero).
Custom controllers for complex policies (multi‑metric, per‑model strategies, pre‑scaling).
Resource Layer
Karpenter – fast, flexible GPU node provisioning.
Cluster Autoscaler – traditional node scaling.
NVIDIA GPU Operator – driver, toolkit, device plugin, DCGM exporter management.
Metric System Upgrade: From Resource to Service Indicators
GPU‑only metrics (e.g., utilization) are useful for capacity planning but not reliable for autoscaling. Production systems should prioritize business‑level signals and use resource metrics as guard‑rails.
Queue Metrics:
request_queue_length, waiting_requests – direct backlog signal
Latency Metrics:
p95_latency_ms, ttft_ms – user‑experience mapping
Concurrency Metrics:
active_requests, inflight_requests – instance load
LLM‑Specific Metrics:
kv_cache_usage_ratio, prefill_tokens, decode_tokens_per_second – model pressure
Resource Guard Metrics:
gpu_utilization, gpu_memory_ratio – hardware safety netQueue length decides "whether to scale", latency decides "how much to scale", GPU metrics decide "if hardware limits are reached".
Choosing Between HPA, KEDA, and KServe
HPA
Best for stable loads, native Kubernetes integration, low ops cost.
Weak for sudden spikes, limited scale‑to‑zero, less expressive policies.
KEDA
Ideal for bursty, queue‑driven traffic and scale‑to‑zero.
Adds an extra control layer and may suffer cold‑start latency if not tuned.
KServe
Suitable for platform‑wide governance, multi‑model, canary releases.
Higher learning and governance cost.
Recommendation: single‑model or few services → HPA + Prometheus Adapter; unified AI platform → KServe + Karpenter + GPU Operator; bursty workloads → KEDA + Prometheus trigger.
Node Elasticity: GPU Node Supply Must Keep Up
When Pods scale out but no GPU node is available, Pods stay pending, causing apparent timeouts. The full node‑supply chain includes:
Pod count increases.
Pending Pods appear.
Node provisioner detects the gap.
Correct GPU node type is launched.
Node registers and becomes schedulable.
Pod starts and finishes warm‑up.
Karpenter is preferred over the traditional Cluster Autoscaler because it selects instance types flexibly, launches faster, supports fine‑grained capacity optimization, and works well with mixed On‑Demand/Spot pools.
GPU Resource Sharing Options
MIG
Strong isolation, stable performance, good for multi‑tenant.
Fixed slice granularity, constraints on model size.
vGPU
Higher density, suitable for small models and many tenants.
Weaker isolation and predictability, higher engineering complexity.
Time‑Slice Reuse
Works for offline or weak‑real‑time jobs; risky for online inference due to tail‑latency amplification.
Cold‑Start Is the Biggest Enemy
Cold‑start phases (node launch, image pull, weight download, model load, engine warm‑up) can take 60‑300 seconds, making the scaling control visibly lagging.
Cold‑Start Mitigations
Keep warm instances (minReplicas ≥ 1 or ≥ 2 for critical models).
Pre‑pull images via DaemonSet.
Cache models locally on NVMe or shared high‑speed FS.
Readiness probe must wait for model load, CUDA context, and a dummy inference.
Production‑Ready Implementation Checklist
Core Components
Kubernetes 1.27+
NVIDIA GPU Operator
Prometheus Operator & DCGM Exporter
Prometheus Adapter (or KEDA) for custom metrics
Karpenter for GPU node provisioning
Inference framework (vLLM, Triton, custom service)
Business Metric Export Example (Python/FastAPI)
import asyncio, time
from contextlib import asynccontextmanager
from fastapi import FastAPI, HTTPException
from prometheus_client import Counter, Gauge, Histogram, generate_latest
MAX_CONCURRENCY = 16
QUEUE_LIMIT = 256
request_waiting_gauge = Gauge("inference_requests_waiting", "Number of requests waiting in queue")
request_inflight_gauge = Gauge("inference_requests_inflight", "Number of requests currently executing")
ttft_histogram = Histogram("inference_ttft_ms", "Time to first token in ms", buckets=(50,100,200,400,800,1200,2000,5000))
request_total = Counter("inference_requests_total", "Total inference requests", ["status"])
kv_cache_usage_gauge = Gauge("inference_kv_cache_usage_ratio", "KV cache usage ratio")
# ... FastAPI app with /healthz, /readyz, /metrics, /generate endpoints ...This code demonstrates three principles: bounded queue, explicit metric export, and separating readiness from warm‑up.
Key HPA Configuration (Queue + KV‑Cache)
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: llm-inference-hpa
namespace: inference
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: llm-inference
minReplicas: 2
maxReplicas: 20
behavior:
scaleUp:
stabilizationWindowSeconds: 30
policies:
- type: Pods
value: 4
periodSeconds: 60
- type: Percent
value: 100
periodSeconds: 60
scaleDown:
stabilizationWindowSeconds: 600
policies:
- type: Percent
value: 20
periodSeconds: 60
metrics:
- type: Pods
pods:
metric:
name: inference_requests_waiting
target:
type: AverageValue
averageValue: "4"
- type: Pods
pods:
metric:
name: inference_kv_cache_usage_ratio
target:
type: AverageValue
averageValue: "0.75"Key points: keep minReplicas ≥ 2 for fault tolerance, use a long scale‑down window (600 s) to avoid thrashing, and combine queue length (primary) with KV‑cache usage (guard‑rail).
KEDA ScaledObject Example
apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
name: llm-inference-scaledobject
namespace: inference
spec:
scaleTargetRef:
name: llm-inference
minReplicaCount: 1
maxReplicaCount: 30
pollingInterval: 15
cooldownPeriod: 300
triggers:
- type: prometheus
metadata:
serverAddress: http://prometheus-operated.monitoring.svc:9090
metricName: inference_requests_waiting_total
query: sum(inference_requests_waiting{namespace="inference",app="llm-inference"})
threshold: "8"When using scale‑to‑zero, ensure the business can tolerate cold‑start, the gateway protects against queue overflow, and model caching is in place.
Deployment Example with Full Probes and Lifecycle
apiVersion: apps/v1
kind: Deployment
metadata:
name: llm-inference
namespace: inference
spec:
replicas: 2
selector:
matchLabels:
app: llm-inference
template:
metadata:
labels:
app: llm-inference
spec:
terminationGracePeriodSeconds: 180
nodeSelector:
accelerator: nvidia-l40s
workload-type: online-inference
tolerations:
- key: "nvidia.com/gpu"
operator: Exists
effect: NoSchedule
containers:
- name: server
image: registry.example.com/llm-inference:1.0.0
resources:
requests:
cpu: "4"
memory: "16Gi"
nvidia.com/gpu: "1"
limits:
cpu: "8"
memory: "24Gi"
nvidia.com/gpu: "1"
readinessProbe:
httpGet:
path: /readyz
port: 8080
initialDelaySeconds: 10
periodSeconds: 5
livenessProbe:
httpGet:
path: /healthz
port: 8080
initialDelaySeconds: 30
periodSeconds: 10
lifecycle:
preStop:
exec:
command: ["/bin/sh", "-c", "sleep 20"]
env:
- name: NVIDIA_VISIBLE_DEVICES
value: allImportant: configure preStop, graceful termination, and ensure readiness only passes after warm‑up to avoid request loss during scaling.
NodePool Example for Karpenter
apiVersion: karpenter.sh/v1
kind: NodePool
metadata:
name: gpu-elastic
spec:
template:
metadata:
labels:
workload-type: online-inference
accelerator: nvidia-l40s
spec:
taints:
- key: nvidia.com/gpu
effect: NoSchedule
requirements:
- key: kubernetes.io/arch
operator: In
values: ["amd64"]
- key: karpenter.k8s.aws/instance-category
operator: In
values: ["g"]
- key: karpenter.k8s.aws/instance-gpu-count
operator: In
values: ["1","4","8"]
expireAfter: 720h
disruption:
consolidationPolicy: WhenEmptyOrUnderutilized
consolidateAfter: 300s
limits:
cpu: "2000"Separate NodePools per GPU model, use Spot for low‑priority pools, and keep a core pool of stable instances for latency‑critical traffic.
Common Pitfalls
Relying solely on GPU utilization for scaling.
Readiness probes passing before model warm‑up.
Over‑aggressive scale‑down causing thrashing.
Pod autoscaler and node autoscaler not integrated, leading to "fake" scaling.
Ignoring long‑running or streaming requests during termination.
Final Checklist
Deploy NVIDIA GPU Operator, DCGM Exporter, and Prometheus.
Expose business metrics (queue, concurrency, TTFT, KV‑cache, error rate).
Start with HPA or KEDA for pod‑level autoscaling.
Add Karpenter to provision GPU nodes on demand.
Optimize readiness, warm‑up, image pre‑pull, and model caching.
Consider MIG/vGPU to increase density after stability is proven.
Build platform‑level governance (model routing, canary, quota, cost allocation).
Conclusion
Scaling GPU inference on Kubernetes is not just “attach an HPA to a Deployment”. It requires a multi‑layer control loop that drives scaling from business‑level SLO metrics, coordinates pod and node elasticity, mitigates cold‑start, and chooses the right resource‑sharing strategy (MIG, vGPU, or time‑slice). When these layers are correctly wired, GPU inference services become stable, cost‑effective, and truly elastic.
Ray's Galactic Tech
Practice together, never alone. We cover programming languages, development tools, learning methods, and pitfall notes. We simplify complex topics, guiding you from beginner to advanced. Weekly practical content—let's grow together!
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
