Cloud Native 12 min read

Deploy Massive LLMs on Kubernetes: Step‑by‑Step Guide for Ollama and DeepSeek‑R1

This guide explains how to deploy large‑scale AI models such as Ollama and DeepSeek‑R1 on a Kubernetes 1.30 cluster, covering hardware requirements, PVC and deployment manifests, service exposure, image pulling, verification steps, API access, and monitoring with Prometheus and Grafana.

Full-Stack DevOps & Kubernetes

Feb 18, 2025

Deploy Massive LLMs on Kubernetes: Step‑by‑Step Guide for Ollama and DeepSeek‑R1

1. Overview

Ollama is a tool for local deployment of large language models (LLMs) such as GPT‑3, BERT, T5, providing private, on‑prem inference. DeepSeek‑R1 is a semantic‑search‑oriented LLM that processes massive text corpora.

Private deployment : run models on‑premises to avoid data leakage.

Multi‑model support : works with various pretrained LLMs.

Efficiency : leverages local compute for faster inference.

Easy integration : simple API for embedding into applications.

2. Prerequisites

Hardware

CPU: at least 32 vCPU (recommended 64 vCPU )

Memory: at least 128 GB (recommended 256 GB )

Storage: each node should have ≥ 1 TB SSD

Network

Internal bandwidth ≥ 10 Gbps to avoid bottlenecks during distributed inference.

Kubernetes cluster

Version: 1.30

Multi‑node cluster for scheduling and scaling.

Enable GPU resources if acceleration is needed.

3. Pull Docker images

ctr -n=k8s.io images pull ollama/ollama-model:latest
ctr -n=k8s.io images pull deepseek-r1/deepseek-r1-model:671b

4. Kubernetes manifests

Persistent Volume Claim (PVC)

apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: model-pvc
spec:
  accessModes:
  - ReadWriteOnce
  resources:
    requests:
      storage: 2Ti   # adjust to model size

Ollama Deployment

apiVersion: apps/v1
kind: Deployment
metadata:
  name: ollama-model
spec:
  replicas: 1
  selector:
    matchLabels:
      app: ollama
  template:
    metadata:
      labels:
        app: ollama
    spec:
      containers:
      - name: ollama
        image: ollama/ollama-model:latest
        resources:
          requests:
            memory: "64Gi"
            cpu: "16"
          limits:
            memory: "128Gi"
            cpu: "32"
        ports:
        - containerPort: 8080
        volumeMounts:
        - name: model-storage
          mountPath: /mnt/models
      volumes:
      - name: model-storage
        persistentVolumeClaim:
          claimName: model-pvc

DeepSeek‑R1 Deployment

apiVersion: apps/v1
kind: Deployment
metadata:
  name: deepseek-r1-model
spec:
  replicas: 1
  selector:
    matchLabels:
      app: deepseek-r1
  template:
    metadata:
      labels:
        app: deepseek-r1
    spec:
      containers:
      - name: deepseek-r1
        image: deepseek-r1/deepseek-r1-model:671b
        resources:
          requests:
            memory: "128Gi"
            cpu: "32"
          limits:
            memory: "256Gi"
            cpu: "64"
        ports:
        - containerPort: 8081
        volumeMounts:
        - name: model-storage
          mountPath: /mnt/models
      volumes:
      - name: model-storage
        persistentVolumeClaim:
          claimName: model-pvc

Service manifests

apiVersion: v1
kind: Service
metadata:
  name: ollama-service
spec:
  selector:
    app: ollama
  ports:
  - protocol: TCP
    port: 8080
    targetPort: 8080
  type: LoadBalancer

apiVersion: v1
kind: Service
metadata:
  name: deepseek-r1-service
spec:
  selector:
    app: deepseek-r1
  ports:
  - protocol: TCP
    port: 8081
    targetPort: 8081
  type: LoadBalancer

5. Deploy to the cluster

kubectl apply -f model-pvc.yaml
kubectl apply -f ollama-deployment.yaml
kubectl apply -f deepseek-r1-deployment.yaml
kubectl apply -f ollama-service.yaml
kubectl apply -f deepseek-r1-service.yaml

6. Verify deployment

Check pod status: kubectl get pods and kubectl describe pod <pod-name> Check service status: kubectl get svc – ensure LoadBalancer has an external IP.

7. Access the models

Obtain the external IP from the LoadBalancer services and call the model APIs. Example for Ollama (port 30000) and DeepSeek‑R1 (port 30001):

curl -X POST http://<external-ip>:30000/inference -d '{"input": "your text input"}'

8. Monitoring (optional)

Deploy Prometheus to collect cluster and pod metrics.

Configure Grafana dashboards to visualize inference latency, throughput, and resource usage.

Set up Alertmanager alerts for conditions such as memory usage exceeding defined thresholds.

AI Kubernetes DeepSeek large language model Ollama

Written by

Full-Stack DevOps & Kubernetes

Focused on sharing DevOps, Kubernetes, Linux, Docker, Istio, microservices, Spring Cloud, Python, Go, databases, Nginx, Tomcat, cloud computing, and related technologies.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.