Artificial Intelligence 49 min read

How to Deploy Scalable LLM Inference on Kubernetes with GPU Autoscaling

This guide walks through building a production‑grade Kubernetes GPU cluster for large language model inference, covering hardware sizing, GPU resource scheduling, model storage options, automated scaling with HPA, health checks, monitoring, troubleshooting, and multi‑model deployment strategies.

MaGe Linux Operations

Jan 18, 2026

How to Deploy Scalable LLM Inference on Kubernetes with GPU Autoscaling

Overview

The rapid growth of large language models (LLMs) such as ChatGPT, LLaMA, and GLM has created a demand for high‑throughput, low‑latency inference services. Single‑GPU servers cannot meet the memory and compute requirements of models that exceed 10‑100 GB, especially under high QPS workloads. This article presents a complete, production‑ready solution for deploying LLM inference on a Kubernetes GPU cluster, from hardware selection to monitoring and disaster recovery.

Key Technical Features

GPU Resource Management : Uses the NVIDIA Device Plugin and GPU Operator to expose GPUs as first‑class Kubernetes resources, enabling time‑slicing, shared usage, and per‑node labeling.

Elastic Scaling : Configures Horizontal Pod Autoscaler (HPA) with CPU, memory, and custom GPU utilization metrics to automatically scale pods between 2 and 10 replicas.

Model Storage Options : Compares PVC + NFS, S3 + InitContainer, and container‑image embedding, recommending NFS for multi‑node sharing and S3 for cloud‑native environments.

High‑Availability Design : Deploys multiple replicas, pod anti‑affinity, node taints, and tolerations to ensure zero‑downtime rolling updates.

Observability : Integrates Prometheus, Grafana, and custom metrics (GPU utilization, memory usage, inference latency) with alerts for high memory usage, low GPU utilization, and request errors.

Step‑by‑Step Deployment

1. System Preparation

Verify OS, kernel, and NVIDIA driver (525+). Install Docker/Containerd, kubeadm, kubelet, and kubectl (v1.28+). Disable swap, load kernel modules, and configure sysctl for networking.

2. Install NVIDIA Stack

Install NVIDIA driver (e.g., sudo apt install -y nvidia-driver-525).

Install NVIDIA Container Toolkit ( sudo apt install -y nvidia-container-toolkit) and configure Docker or containerd.

Deploy NVIDIA Device Plugin via Helm (

helm repo add nvdp https://nvidia.github.io/k8s-device-plugin

) with a custom gpu-device-plugin-values.yaml that enables time‑slicing (4 replicas per GPU).

3. Build the Kubernetes Cluster

Create a control‑plane node (kubeadm init) and join GPU worker nodes using the generated token. Label GPU nodes ( gpu-type: a100, gpu-memory: 80GB) and add a nvidia.com/gpu: true taint to prevent non‑GPU workloads from being scheduled.

4. Configure Model Storage

Three storage patterns are provided:

PVC + NFS : Shared read‑only NFS volume for all pods.

S3 + InitContainer : Downloads model files at pod start, suitable for cloud environments.

Container Image : Embeds small models directly in the image.

5. Deploy the Inference Service

A FastAPI‑based inference server is containerized. The Deployment includes resource requests ( cpu: 4, memory: 16Gi, nvidia.com/gpu: 1), node selector, tolerations, pod anti‑affinity, initContainer for model download, and liveness/readiness probes. A Service exposes port 8000, and an Ingress routes external traffic with rate‑limiting.

6. Autoscaling and Monitoring

The HPA scales based on CPU, memory, and a custom gpu_utilization metric (requires DCGM Exporter). Prometheus scrapes /metrics from the application, exposing GPU utilization, memory usage, inference latency histograms, request rates, and error ratios. Grafana dashboards visualize these metrics and trigger alerts for high GPU memory (>90 %), low utilization (<30 %), or latency spikes.

Troubleshooting & Performance Tuning

Common issues such as pods stuck in Pending, GPU not recognized, model loading failures, OOM kills, and high latency are addressed with diagnostic commands ( kubectl describe pod, nvidia-smi, kubectl logs) and remediation steps (adjust resource limits, fix node labels/taints, increase GPU memory, enable model quantization, or tune batch size).

Backup, Restore, and Disaster Recovery

A Bash script ( backup-llm-inference.sh) archives Kubernetes manifests, model files, logs, and monitoring configurations, uploads them to S3, and retains a 7‑day rotation. A CronJob runs the backup daily at 02:00. Restoration steps include scaling the deployment to zero, applying saved manifests, syncing model data, and verifying service health.

Summary of Results

In production, the cluster runs 20+ GPU nodes (A100 80GB, V100 32GB, T4 16GB) supporting over 500 million inference calls per day with average latency under 100 ms (P50) and GPU utilization between 60‑90 %. Autoscaling reduces cost by 30 % during off‑peak periods, and the monitoring stack provides real‑time alerts for SLA compliance.

Docker LLM Kubernetes autoscaling Prometheus GPU Inference

Written by

MaGe Linux Operations

Founded in 2009, MaGe Education is a top Chinese high‑end IT training brand. Its graduates earn 12K+ RMB salaries, and the school has trained tens of thousands of students. It offers high‑pay courses in Linux cloud operations, Python full‑stack, automation, data analysis, AI, and Go high‑concurrency architecture. Thanks to quality courses and a solid reputation, it has talent partnerships with numerous internet firms.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.