Cloud Native 9 min read

Deploy Large Language Models on Kubernetes with Ollama and Open-WebUI

This guide walks through deploying a local LLM on Kubernetes using Ollama for model serving and Open-WebUI for a web interface, covering namespace creation, storage setup, GPU support, service exposure, validation, and model download to ensure privacy, low latency, and high availability.

Instant Consumer Technology Team
Instant Consumer Technology Team
Instant Consumer Technology Team
Deploy Large Language Models on Kubernetes with Ollama and Open-WebUI

Background

With the widespread adoption of large language models (LLM) in enterprise applications, running models locally is essential for data privacy, cost control, and reduced latency. Ollama simplifies local LLM execution, while Kubernetes provides the orchestration needed for production deployment.

Deployment

1. Create Ollama namespace

apiVersion: v1
kind: Namespace
metadata:
  name: ollama

Apply the manifest:

kubectl apply -f ollama-namespace.yaml

2. Prepare storage class

Install OpenEBS local PV via Helm and set it as the default storage class.

# Add Helm repo
helm repo add openebs-localpv https://openebs.github.io/dynamic-localpv-provisioner
helm repo update
# Install
helm upgrade --install openebs-localpv openebs-localpv/localpv-provisioner \
  --namespace openebs --create-namespace \
  --set hostpathClass.basePath="/data/openebs/local" \
  --set global.imageRegistry="ccr.ccs.tencentyun.com" \
  --set localpv.image.repository="chijinjing/provisioner-localpv" \
  --set helperPod.image.repository="chijinjing/linux-utils"
# Make it default
kubectl patch storageclass openebs-hostpath -p '{"metadata":{"annotations":{"storageclass.kubernetes.io/is-default-class":"true"}}}'

3. Deploy Ollama

apiVersion: apps/v1
kind: StatefulSet
metadata:
  name: ollama
  namespace: ollama
spec:
  serviceName: "ollama"
  replicas: 1
  selector:
    matchLabels:
      app: ollama
  template:
    metadata:
      labels:
        app: ollama
    spec:
      containers:
      - name: ollama
        image: ollama/ollama:latest
        ports:
        - containerPort: 11434
        resources:
          requests:
            cpu: "2000m"
            memory: "2Gi"
          limits:
            cpu: "4000m"
            memory: "4Gi"
            nvidia.com/gpu: "0"
        volumeMounts:
        - name: ollama-volume
          mountPath: /root/.ollama
        tty: true
  volumeClaimTemplates:
  - metadata:
      name: ollama-volume
    spec:
      accessModes: ["ReadWriteOnce"]
      resources:
        requests:
          storage: 30Gi
apiVersion: v1
kind: Service
metadata:
  name: ollama-service
  namespace: ollama
spec:
  selector:
    app: ollama
  ports:
  - protocol: TCP
    port: 11434
    targetPort: 11434
kubectl apply -f ollama-statefulset.yaml
kubectl apply -f ollama-service.yaml

4. Ollama with GPU

If the cluster has GPUs, add GPU requests and a node selector.

apiVersion: apps/v1
kind: StatefulSet
metadata:
  name: ollama
  namespace: ollama
spec:
  serviceName: "ollama"
  replicas: 1
  selector:
    matchLabels:
      app: ollama
  template:
    metadata:
      labels:
        app: ollama
    spec:
      containers:
      - name: ollama
        image: ollama/ollama:latest
        ports:
        - containerPort: 11434
        resources:
          requests:
            nvidia.com/gpu: 1   # request 1 GPU
            memory: "2Gi"
            cpu: "2000m"
          limits:
            nvidia.com/gpu: 1
            memory: "4Gi"
            cpu: "4000m"
        volumeMounts:
        - name: ollama-volume
          mountPath: /root/.ollama
        tty: true
        nodeSelector:
          gputype: nvidia-tesla-t4
  volumeClaimTemplates:
  - metadata:
      name: ollama-volume
    spec:
      accessModes: ["ReadWriteOnce"]
      resources:
        requests:
          storage: 30Gi
Nvidia is the most common GPU, but Kubernetes also supports AMD ( amd.com/gpu ) and Intel ( gpu.intel.com ) devices.

5. Deploy Open-WebUI

apiVersion: apps/v1
kind: Deployment
metadata:
  name: open-webui-deployment
  namespace: ollama
spec:
  replicas: 1
  selector:
    matchLabels:
      app: open-webui
  template:
    metadata:
      labels:
        app: open-webui
    spec:
      containers:
      - name: open-webui
        image: ghcr.io/open-webui/open-webui:main
        ports:
        - containerPort: 8080
        resources:
          requests:
            cpu: "500m"
            memory: "500Mi"
          limits:
            cpu: "1000m"
            memory: "1Gi"
        env:
        - name: OLLAMA_BASE_URL
          value: "http://ollama-service.ollama.svc.cluster.local:11434"
        tty: true
        volumeMounts:
        - name: webui-volume
          mountPath: /app/backend/data
      volumes:
      - name: webui-volume
        persistentVolumeClaim:
          claimName: open-webui-pvc
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  labels:
    app: open-webui
  name: open-webui-pvc
  namespace: ollama
spec:
  accessModes: ["ReadWriteOnce"]
  resources:
    requests:
      storage: 2Gi
apiVersion: v1
kind: Service
metadata:
  name: open-webui-service
  namespace: ollama
spec:
  type: NodePort
  selector:
    app: open-webui
  ports:
  - protocol: TCP
    port: 8080
    targetPort: 8080
    nodePort: 30080
kubectl apply -f webui-deployment.yaml
kubectl apply -f webui-pvc.yaml
kubectl apply -f webui-service.yaml

6. Validation

Check that the pods and services are running, then access Open-WebUI via the NodePort or an Ingress in production.

# kubectl get pods -n ollama
NAME                     READY   STATUS    RESTARTS   AGE
ollama-0                 1/1     Running   0          140m
open-webui-deployment-... 1/1    Running   1 (133m ago) 135m

# kubectl get svc -n ollama
NAME            TYPE        CLUSTER-IP      EXTERNAL-IP   PORT(S)               AGE
ollama-service  ClusterIP   10.96.0.85      <none>        11434/TCP            140m
open-webui-service NodePort 10.96.0.181    <none>        8080:30080/TCP       133m

For production, create an Ingress:

apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  name: open-webui-ingress
  namespace: open-webui
  annotations:
    nginx.ingress.kubernetes.io/rewrite-target: /
spec:
  ingressClassName: nginx
  rules:
  - host: open-webui.example.com
    http:
      paths:
      - path: /
        pathType: Prefix
        backend:
          service:
            name: open-webui-service
            port:
              number: 8080

7. Download a Large Language Model

From the Open-WebUI UI, go to Admin Panel → Settings → Model and select a model from the Ollama library, or download directly inside the Ollama pod:

# kubectl -n ollama exec -it ollama-0 -- sh
ollama run deepseek-r1:7b

After the model is installed, you can start asking questions through the web interface.

Conclusion

This guide shows how to locally deploy large language models on Kubernetes using Ollama for model serving and Open-WebUI for a user interface, achieving data privacy, low latency, and high availability.

KubernetesLarge Language ModelGPUOllamaOpen WebUI
Instant Consumer Technology Team
Written by

Instant Consumer Technology Team

Instant Consumer Technology Team

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.