Backend Development 47 min read

How to Deploy vLLM Inference Service on Kubernetes with Ingress and Service Load Balancing

This guide walks through deploying a production‑grade vLLM inference service on Kubernetes, covering GPU resource scheduling, Service and Ingress configuration, session affinity, health checks, performance tuning, scaling, monitoring, fault‑tolerance, and best‑practice recommendations for high‑availability AI workloads.

MaGe Linux Operations

Mar 12, 2026

How to Deploy vLLM Inference Service on Kubernetes with Ingress and Service Load Balancing

The article explains how to set up a scalable, high‑availability vLLM inference service on a Kubernetes cluster (v1.29+). It starts with an overview of the problem space, emphasizing the need for proper Service and Ingress configuration to expose GPU‑accelerated pods safely and efficiently. The author discusses the trade‑offs between Headless Service and ClusterIP, recommending ClusterIP for stateless inference workloads, and details the selection of Ingress controllers (Nginx, Traefik, AWS ALB) with a focus on Nginx as the default choice due to its maturity and performance.

Key technical sections cover:

GPU resource scheduling : Using native Kubernetes GPU support (resources.limits.nvidia.com/gpu) and node selectors to ensure pods run on GPU nodes, with recommended GPU‑memory‑utilization settings for different model sizes.

Session affinity : Enabling cookie‑based affinity via nginx.ingress.kubernetes.io/affinity: cookie and configuring appropriate cookie names and expiration (300‑1800 seconds) to preserve KV‑Cache across requests.

Health probes : Detailed liveness, readiness, and startup probe settings (initialDelaySeconds, periodSeconds, failureThreshold) tuned for the long model‑loading times of vLLM.

Ingress timeout and rate‑limiting : Setting proxy-read-timeout up to 900 seconds for large models, and configuring limit-rps, limit-burst, and limit-rate to protect the service from overload.

Deployment YAML : A production‑ready manifest with 4 replicas, rolling‑update strategy (maxSurge: 25%, maxUnavailable: 0), pod anti‑affinity, GPU limits, and environment variables such as GPU_MEMORY_UTILIZATION, MAX_NUM_SEQS, MAX_NUM_BATCHED_TOKENS, and SWAP_SPACE for performance tuning.

Service YAML : ClusterIP service exposing port 80 for HTTP and port 8080 for Prometheus metrics, with client‑IP session affinity.

Ingress YAML : TLS termination via Cert‑Manager, extensive Nginx annotations for body size, buffer sizes, connection limits, and cookie‑based session handling.

Traefik comparison : Shows how to achieve similar functionality with IngressRoute and Middleware resources, highlighting the slightly higher latency but easier dynamic updates.

Real‑world case studies : Two production scenarios—single‑model deployment for an e‑commerce recommendation engine and multi‑model A/B testing with traffic splitting—illustrate configuration choices, performance results (QPS ≈ 750, P95 latency ≈ 420 ms), and pitfalls such as over‑allocating GPU memory or setting session‑affinity timeouts too long.

Best practices : Performance optimizations (HTTP/2, keep‑alive, batch processing), GPU memory tuning, KV‑Cache hit‑rate monitoring, pod anti‑affinity, and graceful shutdown via lifecycle hooks.

Security hardening : Enforcing TLS 1.3, configuring basic‑auth or JWT authentication, and applying rate‑limiting to mitigate abuse.

High‑availability : Multi‑replica deployments, pod anti‑affinity, rolling updates with zero downtime, and health‑check configurations to ensure rapid failover.

Monitoring & alerting : Prometheus queries for QPS, P95 latency, GPU utilization, error rate, and KV‑Cache hit‑rate, plus a ready‑to‑apply PrometheusRule manifest with alerts for high QPS, high latency, error spikes, and low cache hit‑rate.

Backup & restore : Bash scripts for exporting all relevant Kubernetes resources (Deployments, Services, Ingresses, ConfigMaps, Secrets, PVCs, Prometheus rules) and a restore workflow that safely scales down the service, applies the saved manifests, and verifies pod readiness.

The article concludes with a concise technical checklist, recommended next steps (autoscaling, tensor parallelism, service‑mesh integration), a reference table of configuration parameters, and a glossary of key terms.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Monitoring high availability Kubernetes Load Balancing vLLM security GPU Service Ingress

Written by

MaGe Linux Operations

Founded in 2009, MaGe Education is a top Chinese high‑end IT training brand. Its graduates earn 12K+ RMB salaries, and the school has trained tens of thousands of students. It offers high‑pay courses in Linux cloud operations, Python full‑stack, automation, data analysis, AI, and Go high‑concurrency architecture. Thanks to quality courses and a solid reputation, it has talent partnerships with numerous internet firms.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.