Deploy Massive LLMs on Kubernetes: Step‑by‑Step Guide for Ollama and DeepSeek‑R1
This guide explains how to deploy large‑scale AI models such as Ollama and DeepSeek‑R1 on a Kubernetes 1.30 cluster, covering hardware requirements, PVC and deployment manifests, service exposure, image pulling, verification steps, API access, and monitoring with Prometheus and Grafana.
1. Overview
Ollama is a tool for local deployment of large language models (LLMs) such as GPT‑3, BERT, T5, providing private, on‑prem inference. DeepSeek‑R1 is a semantic‑search‑oriented LLM that processes massive text corpora.
Private deployment : run models on‑premises to avoid data leakage.
Multi‑model support : works with various pretrained LLMs.
Efficiency : leverages local compute for faster inference.
Easy integration : simple API for embedding into applications.
2. Prerequisites
Hardware
CPU: at least 32 vCPU (recommended 64 vCPU )
Memory: at least 128 GB (recommended 256 GB )
Storage: each node should have ≥ 1 TB SSD
Network
Internal bandwidth ≥ 10 Gbps to avoid bottlenecks during distributed inference.
Kubernetes cluster
Version: 1.30
Multi‑node cluster for scheduling and scaling.
Enable GPU resources if acceleration is needed.
3. Pull Docker images
ctr -n=k8s.io images pull ollama/ollama-model:latest
ctr -n=k8s.io images pull deepseek-r1/deepseek-r1-model:671b4. Kubernetes manifests
Persistent Volume Claim (PVC)
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: model-pvc
spec:
accessModes:
- ReadWriteOnce
resources:
requests:
storage: 2Ti # adjust to model sizeOllama Deployment
apiVersion: apps/v1
kind: Deployment
metadata:
name: ollama-model
spec:
replicas: 1
selector:
matchLabels:
app: ollama
template:
metadata:
labels:
app: ollama
spec:
containers:
- name: ollama
image: ollama/ollama-model:latest
resources:
requests:
memory: "64Gi"
cpu: "16"
limits:
memory: "128Gi"
cpu: "32"
ports:
- containerPort: 8080
volumeMounts:
- name: model-storage
mountPath: /mnt/models
volumes:
- name: model-storage
persistentVolumeClaim:
claimName: model-pvcDeepSeek‑R1 Deployment
apiVersion: apps/v1
kind: Deployment
metadata:
name: deepseek-r1-model
spec:
replicas: 1
selector:
matchLabels:
app: deepseek-r1
template:
metadata:
labels:
app: deepseek-r1
spec:
containers:
- name: deepseek-r1
image: deepseek-r1/deepseek-r1-model:671b
resources:
requests:
memory: "128Gi"
cpu: "32"
limits:
memory: "256Gi"
cpu: "64"
ports:
- containerPort: 8081
volumeMounts:
- name: model-storage
mountPath: /mnt/models
volumes:
- name: model-storage
persistentVolumeClaim:
claimName: model-pvcService manifests
apiVersion: v1
kind: Service
metadata:
name: ollama-service
spec:
selector:
app: ollama
ports:
- protocol: TCP
port: 8080
targetPort: 8080
type: LoadBalancer apiVersion: v1
kind: Service
metadata:
name: deepseek-r1-service
spec:
selector:
app: deepseek-r1
ports:
- protocol: TCP
port: 8081
targetPort: 8081
type: LoadBalancer5. Deploy to the cluster
kubectl apply -f model-pvc.yaml
kubectl apply -f ollama-deployment.yaml
kubectl apply -f deepseek-r1-deployment.yaml
kubectl apply -f ollama-service.yaml
kubectl apply -f deepseek-r1-service.yaml6. Verify deployment
Check pod status: kubectl get pods and kubectl describe pod <pod-name> Check service status: kubectl get svc – ensure LoadBalancer has an external IP.
7. Access the models
Obtain the external IP from the LoadBalancer services and call the model APIs. Example for Ollama (port 30000) and DeepSeek‑R1 (port 30001):
curl -X POST http://<external-ip>:30000/inference -d '{"input": "your text input"}'8. Monitoring (optional)
Deploy Prometheus to collect cluster and pod metrics.
Configure Grafana dashboards to visualize inference latency, throughput, and resource usage.
Set up Alertmanager alerts for conditions such as memory usage exceeding defined thresholds.
Full-Stack DevOps & Kubernetes
Focused on sharing DevOps, Kubernetes, Linux, Docker, Istio, microservices, Spring Cloud, Python, Go, databases, Nginx, Tomcat, cloud computing, and related technologies.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
