Operations 29 min read

How to Build a Resilient GPU Inference Autoscaling System on Kubernetes

This article explains why scaling GPU inference services on Kubernetes is challenging and presents a multi‑layer control architecture, metric upgrades, and production‑ready implementations using HPA, KEDA, KServe, and Karpenter to achieve stable, cost‑effective autoscaling.

Ray's Galactic Tech
Ray's Galactic Tech
Ray's Galactic Tech
How to Build a Resilient GPU Inference Autoscaling System on Kubernetes

Why GPU Inference Autoscaling Is Hard

Running inference on GPUs in Kubernetes requires balancing latency, throughput, cost, and stability. GPU resources are discrete, expensive, and have long cold‑start times, so naive Horizontal Pod Autoscaler (HPA) configurations are insufficient.

GPU is a discrete, strongly‑constrained resource; you either get a whole card or none.

Model loading and container start‑up are slow (seconds to minutes).

GPU utilization alone can mislead; queue length, first‑token latency, P95/P99 latency, KV‑cache usage, and active request count drive user experience.

Four‑Layer Elastic Control System

Service layer : Pod horizontal scaling.

Resource layer : GPU node supply and reclamation.

Scheduling layer : GPU model, topology, MIG/vGPU slicing.

Application layer : Dynamic batching, concurrency control, model warm‑up, queue protection.

Observability layer : Upgrade from raw GPU metrics to business‑level SLO metrics.

Production Goals

Performance : keep P95 latency within limits, avoid P99 latency avalanches, stabilize first‑token time (TTFT) for large models, and keep queue wait time under business thresholds.

Cost : maximize average GPU utilization, recycle nodes during low‑load periods, use Spot GPUs for non‑critical traffic, and increase density with MIG/vGPU for small models.

Stability : new Pods must finish model warm‑up before becoming ready, scaling‑down must never interrupt ongoing inference sessions, and node failures must not make an entire model unavailable.

Engineering governance : standardize and reuse scaling logic, make metrics auditable and thresholds explainable, decouple service scaling from node scaling, and support multiple inference frameworks (Triton, vLLM, TensorRT‑LLM, custom FastAPI/Grpc).

Recommended Architecture: Four‑Layer Elastic Closed‑Loop

Application Layer

API Gateway / Ingress.

Inference service (vLLM, Triton, custom server).

Model management (weight download, version routing, canary).

Application metric exporter (queue length, TTFT, tokens/s, batch size, KV‑cache usage).

Metrics Layer

Prometheus collects two categories of metrics:

Basic resource metrics: GPU utilization, memory, temperature, power, PCIe/NVLink.

Business metrics: queue length, concurrent requests, TTFT, generation throughput, error rate.

Control Layer

HPA for steady workloads using standard/custom metrics.

KEDA for bursty, queue‑driven traffic (supports scale‑to‑zero).

Custom controllers for complex policies (multi‑metric, per‑model strategies, pre‑scaling).

Resource Layer

Karpenter – fast, flexible GPU node provisioning.

Cluster Autoscaler – traditional node scaling.

NVIDIA GPU Operator – driver, toolkit, device plugin, DCGM exporter management.

Metric System Upgrade: From Resource to Service Indicators

GPU‑only metrics (e.g., utilization) are useful for capacity planning but not reliable for autoscaling. Production systems should prioritize business‑level signals and use resource metrics as guard‑rails.

Queue Metrics:
  request_queue_length, waiting_requests – direct backlog signal
Latency Metrics:
  p95_latency_ms, ttft_ms – user‑experience mapping
Concurrency Metrics:
  active_requests, inflight_requests – instance load
LLM‑Specific Metrics:
  kv_cache_usage_ratio, prefill_tokens, decode_tokens_per_second – model pressure
Resource Guard Metrics:
  gpu_utilization, gpu_memory_ratio – hardware safety net
Queue length decides "whether to scale", latency decides "how much to scale", GPU metrics decide "if hardware limits are reached".

Choosing Between HPA, KEDA, and KServe

HPA

Best for stable loads, native Kubernetes integration, low ops cost.

Weak for sudden spikes, limited scale‑to‑zero, less expressive policies.

KEDA

Ideal for bursty, queue‑driven traffic and scale‑to‑zero.

Adds an extra control layer and may suffer cold‑start latency if not tuned.

KServe

Suitable for platform‑wide governance, multi‑model, canary releases.

Higher learning and governance cost.

Recommendation: single‑model or few services → HPA + Prometheus Adapter; unified AI platform → KServe + Karpenter + GPU Operator; bursty workloads → KEDA + Prometheus trigger.

Node Elasticity: GPU Node Supply Must Keep Up

When Pods scale out but no GPU node is available, Pods stay pending, causing apparent timeouts. The full node‑supply chain includes:

Pod count increases.

Pending Pods appear.

Node provisioner detects the gap.

Correct GPU node type is launched.

Node registers and becomes schedulable.

Pod starts and finishes warm‑up.

Karpenter is preferred over the traditional Cluster Autoscaler because it selects instance types flexibly, launches faster, supports fine‑grained capacity optimization, and works well with mixed On‑Demand/Spot pools.

GPU Resource Sharing Options

MIG

Strong isolation, stable performance, good for multi‑tenant.

Fixed slice granularity, constraints on model size.

vGPU

Higher density, suitable for small models and many tenants.

Weaker isolation and predictability, higher engineering complexity.

Time‑Slice Reuse

Works for offline or weak‑real‑time jobs; risky for online inference due to tail‑latency amplification.

Cold‑Start Is the Biggest Enemy

Cold‑start phases (node launch, image pull, weight download, model load, engine warm‑up) can take 60‑300 seconds, making the scaling control visibly lagging.

Cold‑Start Mitigations

Keep warm instances (minReplicas ≥ 1 or ≥ 2 for critical models).

Pre‑pull images via DaemonSet.

Cache models locally on NVMe or shared high‑speed FS.

Readiness probe must wait for model load, CUDA context, and a dummy inference.

Production‑Ready Implementation Checklist

Core Components

Kubernetes 1.27+

NVIDIA GPU Operator

Prometheus Operator & DCGM Exporter

Prometheus Adapter (or KEDA) for custom metrics

Karpenter for GPU node provisioning

Inference framework (vLLM, Triton, custom service)

Business Metric Export Example (Python/FastAPI)

import asyncio, time
from contextlib import asynccontextmanager
from fastapi import FastAPI, HTTPException
from prometheus_client import Counter, Gauge, Histogram, generate_latest

MAX_CONCURRENCY = 16
QUEUE_LIMIT = 256

request_waiting_gauge = Gauge("inference_requests_waiting", "Number of requests waiting in queue")
request_inflight_gauge = Gauge("inference_requests_inflight", "Number of requests currently executing")
ttft_histogram = Histogram("inference_ttft_ms", "Time to first token in ms", buckets=(50,100,200,400,800,1200,2000,5000))
request_total = Counter("inference_requests_total", "Total inference requests", ["status"])
kv_cache_usage_gauge = Gauge("inference_kv_cache_usage_ratio", "KV cache usage ratio")

# ... FastAPI app with /healthz, /readyz, /metrics, /generate endpoints ...

This code demonstrates three principles: bounded queue, explicit metric export, and separating readiness from warm‑up.

Key HPA Configuration (Queue + KV‑Cache)

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: llm-inference-hpa
  namespace: inference
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: llm-inference
  minReplicas: 2
  maxReplicas: 20
  behavior:
    scaleUp:
      stabilizationWindowSeconds: 30
      policies:
        - type: Pods
          value: 4
          periodSeconds: 60
        - type: Percent
          value: 100
          periodSeconds: 60
    scaleDown:
      stabilizationWindowSeconds: 600
      policies:
        - type: Percent
          value: 20
          periodSeconds: 60
  metrics:
    - type: Pods
      pods:
        metric:
          name: inference_requests_waiting
        target:
          type: AverageValue
          averageValue: "4"
    - type: Pods
      pods:
        metric:
          name: inference_kv_cache_usage_ratio
        target:
          type: AverageValue
          averageValue: "0.75"

Key points: keep minReplicas ≥ 2 for fault tolerance, use a long scale‑down window (600 s) to avoid thrashing, and combine queue length (primary) with KV‑cache usage (guard‑rail).

KEDA ScaledObject Example

apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
  name: llm-inference-scaledobject
  namespace: inference
spec:
  scaleTargetRef:
    name: llm-inference
  minReplicaCount: 1
  maxReplicaCount: 30
  pollingInterval: 15
  cooldownPeriod: 300
  triggers:
    - type: prometheus
      metadata:
        serverAddress: http://prometheus-operated.monitoring.svc:9090
        metricName: inference_requests_waiting_total
        query: sum(inference_requests_waiting{namespace="inference",app="llm-inference"})
        threshold: "8"

When using scale‑to‑zero, ensure the business can tolerate cold‑start, the gateway protects against queue overflow, and model caching is in place.

Deployment Example with Full Probes and Lifecycle

apiVersion: apps/v1
kind: Deployment
metadata:
  name: llm-inference
  namespace: inference
spec:
  replicas: 2
  selector:
    matchLabels:
      app: llm-inference
  template:
    metadata:
      labels:
        app: llm-inference
    spec:
      terminationGracePeriodSeconds: 180
      nodeSelector:
        accelerator: nvidia-l40s
        workload-type: online-inference
      tolerations:
        - key: "nvidia.com/gpu"
          operator: Exists
          effect: NoSchedule
      containers:
        - name: server
          image: registry.example.com/llm-inference:1.0.0
          resources:
            requests:
              cpu: "4"
              memory: "16Gi"
              nvidia.com/gpu: "1"
            limits:
              cpu: "8"
              memory: "24Gi"
              nvidia.com/gpu: "1"
          readinessProbe:
            httpGet:
              path: /readyz
              port: 8080
            initialDelaySeconds: 10
            periodSeconds: 5
          livenessProbe:
            httpGet:
              path: /healthz
              port: 8080
            initialDelaySeconds: 30
            periodSeconds: 10
          lifecycle:
            preStop:
              exec:
                command: ["/bin/sh", "-c", "sleep 20"]
          env:
            - name: NVIDIA_VISIBLE_DEVICES
              value: all

Important: configure preStop, graceful termination, and ensure readiness only passes after warm‑up to avoid request loss during scaling.

NodePool Example for Karpenter

apiVersion: karpenter.sh/v1
kind: NodePool
metadata:
  name: gpu-elastic
spec:
  template:
    metadata:
      labels:
        workload-type: online-inference
        accelerator: nvidia-l40s
    spec:
      taints:
        - key: nvidia.com/gpu
          effect: NoSchedule
      requirements:
        - key: kubernetes.io/arch
          operator: In
          values: ["amd64"]
        - key: karpenter.k8s.aws/instance-category
          operator: In
          values: ["g"]
        - key: karpenter.k8s.aws/instance-gpu-count
          operator: In
          values: ["1","4","8"]
      expireAfter: 720h
      disruption:
        consolidationPolicy: WhenEmptyOrUnderutilized
        consolidateAfter: 300s
      limits:
        cpu: "2000"

Separate NodePools per GPU model, use Spot for low‑priority pools, and keep a core pool of stable instances for latency‑critical traffic.

Common Pitfalls

Relying solely on GPU utilization for scaling.

Readiness probes passing before model warm‑up.

Over‑aggressive scale‑down causing thrashing.

Pod autoscaler and node autoscaler not integrated, leading to "fake" scaling.

Ignoring long‑running or streaming requests during termination.

Final Checklist

Deploy NVIDIA GPU Operator, DCGM Exporter, and Prometheus.

Expose business metrics (queue, concurrency, TTFT, KV‑cache, error rate).

Start with HPA or KEDA for pod‑level autoscaling.

Add Karpenter to provision GPU nodes on demand.

Optimize readiness, warm‑up, image pre‑pull, and model caching.

Consider MIG/vGPU to increase density after stability is proven.

Build platform‑level governance (model routing, canary, quota, cost allocation).

Conclusion

Scaling GPU inference on Kubernetes is not just “attach an HPA to a Deployment”. It requires a multi‑layer control loop that drives scaling from business‑level SLO metrics, coordinates pod and node elasticity, mitigates cold‑start, and chooses the right resource‑sharing strategy (MIG, vGPU, or time‑slice). When these layers are correctly wired, GPU inference services become stable, cost‑effective, and truly elastic.

KubernetesautoscalingPrometheusGPUInferenceHPAKEDAKarpenter
Ray's Galactic Tech
Written by

Ray's Galactic Tech

Practice together, never alone. We cover programming languages, development tools, learning methods, and pitfall notes. We simplify complex topics, guiding you from beginner to advanced. Weekly practical content—let's grow together!

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.