Deploy Large Language Models on Kubernetes with Ollama and Open-WebUI
This guide walks through deploying a local LLM on Kubernetes using Ollama for model serving and Open-WebUI for a web interface, covering namespace creation, storage setup, GPU support, service exposure, validation, and model download to ensure privacy, low latency, and high availability.
Background
With the widespread adoption of large language models (LLM) in enterprise applications, running models locally is essential for data privacy, cost control, and reduced latency. Ollama simplifies local LLM execution, while Kubernetes provides the orchestration needed for production deployment.
Deployment
1. Create Ollama namespace
apiVersion: v1
kind: Namespace
metadata:
name: ollamaApply the manifest:
kubectl apply -f ollama-namespace.yaml2. Prepare storage class
Install OpenEBS local PV via Helm and set it as the default storage class.
# Add Helm repo
helm repo add openebs-localpv https://openebs.github.io/dynamic-localpv-provisioner
helm repo update
# Install
helm upgrade --install openebs-localpv openebs-localpv/localpv-provisioner \
--namespace openebs --create-namespace \
--set hostpathClass.basePath="/data/openebs/local" \
--set global.imageRegistry="ccr.ccs.tencentyun.com" \
--set localpv.image.repository="chijinjing/provisioner-localpv" \
--set helperPod.image.repository="chijinjing/linux-utils"
# Make it default
kubectl patch storageclass openebs-hostpath -p '{"metadata":{"annotations":{"storageclass.kubernetes.io/is-default-class":"true"}}}'3. Deploy Ollama
apiVersion: apps/v1
kind: StatefulSet
metadata:
name: ollama
namespace: ollama
spec:
serviceName: "ollama"
replicas: 1
selector:
matchLabels:
app: ollama
template:
metadata:
labels:
app: ollama
spec:
containers:
- name: ollama
image: ollama/ollama:latest
ports:
- containerPort: 11434
resources:
requests:
cpu: "2000m"
memory: "2Gi"
limits:
cpu: "4000m"
memory: "4Gi"
nvidia.com/gpu: "0"
volumeMounts:
- name: ollama-volume
mountPath: /root/.ollama
tty: true
volumeClaimTemplates:
- metadata:
name: ollama-volume
spec:
accessModes: ["ReadWriteOnce"]
resources:
requests:
storage: 30Gi apiVersion: v1
kind: Service
metadata:
name: ollama-service
namespace: ollama
spec:
selector:
app: ollama
ports:
- protocol: TCP
port: 11434
targetPort: 11434 kubectl apply -f ollama-statefulset.yaml
kubectl apply -f ollama-service.yaml4. Ollama with GPU
If the cluster has GPUs, add GPU requests and a node selector.
apiVersion: apps/v1
kind: StatefulSet
metadata:
name: ollama
namespace: ollama
spec:
serviceName: "ollama"
replicas: 1
selector:
matchLabels:
app: ollama
template:
metadata:
labels:
app: ollama
spec:
containers:
- name: ollama
image: ollama/ollama:latest
ports:
- containerPort: 11434
resources:
requests:
nvidia.com/gpu: 1 # request 1 GPU
memory: "2Gi"
cpu: "2000m"
limits:
nvidia.com/gpu: 1
memory: "4Gi"
cpu: "4000m"
volumeMounts:
- name: ollama-volume
mountPath: /root/.ollama
tty: true
nodeSelector:
gputype: nvidia-tesla-t4
volumeClaimTemplates:
- metadata:
name: ollama-volume
spec:
accessModes: ["ReadWriteOnce"]
resources:
requests:
storage: 30GiNvidia is the most common GPU, but Kubernetes also supports AMD ( amd.com/gpu ) and Intel ( gpu.intel.com ) devices.
5. Deploy Open-WebUI
apiVersion: apps/v1
kind: Deployment
metadata:
name: open-webui-deployment
namespace: ollama
spec:
replicas: 1
selector:
matchLabels:
app: open-webui
template:
metadata:
labels:
app: open-webui
spec:
containers:
- name: open-webui
image: ghcr.io/open-webui/open-webui:main
ports:
- containerPort: 8080
resources:
requests:
cpu: "500m"
memory: "500Mi"
limits:
cpu: "1000m"
memory: "1Gi"
env:
- name: OLLAMA_BASE_URL
value: "http://ollama-service.ollama.svc.cluster.local:11434"
tty: true
volumeMounts:
- name: webui-volume
mountPath: /app/backend/data
volumes:
- name: webui-volume
persistentVolumeClaim:
claimName: open-webui-pvc apiVersion: v1
kind: PersistentVolumeClaim
metadata:
labels:
app: open-webui
name: open-webui-pvc
namespace: ollama
spec:
accessModes: ["ReadWriteOnce"]
resources:
requests:
storage: 2Gi apiVersion: v1
kind: Service
metadata:
name: open-webui-service
namespace: ollama
spec:
type: NodePort
selector:
app: open-webui
ports:
- protocol: TCP
port: 8080
targetPort: 8080
nodePort: 30080 kubectl apply -f webui-deployment.yaml
kubectl apply -f webui-pvc.yaml
kubectl apply -f webui-service.yaml6. Validation
Check that the pods and services are running, then access Open-WebUI via the NodePort or an Ingress in production.
# kubectl get pods -n ollama
NAME READY STATUS RESTARTS AGE
ollama-0 1/1 Running 0 140m
open-webui-deployment-... 1/1 Running 1 (133m ago) 135m
# kubectl get svc -n ollama
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
ollama-service ClusterIP 10.96.0.85 <none> 11434/TCP 140m
open-webui-service NodePort 10.96.0.181 <none> 8080:30080/TCP 133mFor production, create an Ingress:
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
name: open-webui-ingress
namespace: open-webui
annotations:
nginx.ingress.kubernetes.io/rewrite-target: /
spec:
ingressClassName: nginx
rules:
- host: open-webui.example.com
http:
paths:
- path: /
pathType: Prefix
backend:
service:
name: open-webui-service
port:
number: 80807. Download a Large Language Model
From the Open-WebUI UI, go to Admin Panel → Settings → Model and select a model from the Ollama library, or download directly inside the Ollama pod:
# kubectl -n ollama exec -it ollama-0 -- sh
ollama run deepseek-r1:7bAfter the model is installed, you can start asking questions through the web interface.
Conclusion
This guide shows how to locally deploy large language models on Kubernetes using Ollama for model serving and Open-WebUI for a user interface, achieving data privacy, low latency, and high availability.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
