Deploy Ollama and Open-WebUI on Kubernetes: A Step‑by‑Step Guide
This article walks through deploying the open‑source LLM serving tool Ollama and the Open‑WebUI interface on a Kubernetes cluster using Helm, covering GPU considerations, configuration files, service exposure, and model management with practical code examples.
Since the beginning of this year, interest in large language models (LLM) and their deployment on GPU infrastructure has surged, driven by advances in artificial intelligence and machine learning that demand massive compute power. Nvidia’s stock rose, and many large models have emerged; for deploying and managing these models, Ollama and Open‑WebUI are solid choices.
Ollama is an open‑source tool for deploying machine‑learning models, simplifying the management and interaction with LLMs. It offers top‑tier open‑source models such as Llama 3, Phi 3, Mistral, and can be thought of as a Docker‑like platform focused on machine‑learning models.
Deploying a model with Ollama is as easy as using Docker, but the CLI can be cumbersome for newcomers. To alleviate this, the Open‑WebUI project provides a friendly web interface that makes model deployment easier.
For better management, Ollama can be deployed onto a Kubernetes cluster, giving high availability and scalability. A Kubernetes cluster with GPU is ideal, but even without GPU the Llama 3 model runs reasonably well on CPU.
<code>$ kubectl version
Client Version: v1.28.11
Kustomize Version: v5.0.4-0.20230601165947-6ce0bf390ce3
Server Version: v1.28.7
</code>Deploy Ollama to Kubernetes
The Open‑WebUI project supplies a Helm chart that simplifies installing Ollama and Open‑WebUI. Add the chart repository with:
<code>helm repo add open-webui https://helm.openwebui.com/
helm repo update
</code>The chart deploys Ollama by default; you can customize settings such as GPU usage, data persistence, and more by providing your own values file.
<code># myvalues.yaml
ollama:
enabled: true # automatically install Ollama Helm chart
ollama:
gpu:
enabled: false # set to true to use GPU
persistentVolume:
enabled: true
storageClass: nfs-client # specify storage class
pipelines:
enabled: true
persistence:
enabled: true
storageClass: "nfs-client"
service:
type: NodePort
extraEnvVars:
- name: HF_ENDPOINT
value: https://hf-mirror.com
</code>In this configuration you can toggle GPU support, enable persistent storage, and set the Service type to NodePort so the Open‑WebUI can be accessed via the node’s IP and port. You may also configure an Ingress if preferred.
Note: Open‑WebUI by default contacts the HuggingFace model hub, which is inaccessible from mainland China. Set the HF_ENDPOINT environment variable to a mirror address (e.g., https://hf-mirror.com ) to avoid errors.
Install the chart with your custom values:
<code>helm upgrade --install ollama open-webui/open-webui -f myvalues.yaml --create-namespace --namespace kube-ai
</code>After deployment, several pods run in the
kube-ainamespace. Check their status with:
<code>$ kubectl get pods -n kube-ai
NAME READY STATUS RESTARTS AGE
open-webui-0 1/1 Running 0 2m11s
open-webui-ollama-944dd68fc-wxsjf 1/1 Running 0 24h
open-webui-pipelines-557f6f95cd-dfgh8 1/1 Running 0 25h
</code>Because the Service is of type NodePort, you can reach Open‑WebUI via
http://NodeIP:31009:
<code>$ kubectl get svc -n kube-ai
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
open-webui NodePort 10.96.1.212 <none> 80:31009/TCP 25h
open-webui-ollama ClusterIP 10.96.2.112 <none> 11434/TCP 25h
open-webui-pipelines NodePort 10.96.2.170 <none> 9099:32322/TCP 25h
</code>Usage
Open the URL
http://NodeIP:31009in a browser. Register a new account to access the Open‑WebUI homepage.
If you already have an Ollama instance elsewhere, you can add it as an external connection. Configure the Ollama address in the admin panel under Settings → External Connections, then set the Ollama API URL (the Helm deployment already provides this URL).
Switch to the Models tab to pull models from the Ollama library (e.g.,
llama3) via
https://ollama.com/library. Click the pull button, monitor download progress, and once completed, select the model for chat interactions.
Conclusion
This guide demonstrated how to deploy the Llama 3 model on a Kubernetes cluster using Ollama and Open‑WebUI. By containerizing and orchestrating the AI service, we achieved a scalable, maintainable environment with minimal manual intervention. The combination of Open‑WebUI’s simple UI and Kubernetes’ powerful automation positions this stack as a key solution for future AI‑driven applications.
Ops Development Stories
Maintained by a like‑minded team, covering both operations and development. Topics span Linux ops, DevOps toolchain, Kubernetes containerization, monitoring, log collection, network security, and Python or Go development. Team members: Qiao Ke, wanger, Dong Ge, Su Xin, Hua Zai, Zheng Ge, Teacher Xia.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.