Deploy the Full‑Size DeepSeek‑R1 Model on Volcengine Cloud with Terraform and Kubernetes
This guide walks you through two practical solutions for deploying the massive DeepSeek‑R1 model on Volcengine Cloud—one using Terraform for a quick two‑node GPU setup and another leveraging cloud‑native multi‑node distributed inference with Kubernetes, covering resource sizing, environment preparation, model download, monitoring, autoscaling, and storage acceleration.
Introduction
Enterprises increasingly need private, high‑performance AI inference services; deploying the full‑size DeepSeek‑R1 (671 B parameters) on Volcengine Cloud addresses data privacy, latency, and scalability challenges.
Deployment Options
Option 1 – Terraform One‑Click Deployment : Use an IaC script to provision two 8‑GPU ECS instances, download the model over the internal network, and launch containers.
Option 2 – Cloud‑Native Multi‑Node Distributed Inference : Deploy a Kubernetes VKE cluster with Leader‑Worker Set (LWS) to manage leader and worker pods, enable RDMA networking, and use SGLang as the inference engine.
Option 1 Steps
Install Terraform and initialize the environment (
<code>wget https://public-terraform-cn-beijing.tos-cn-beijing.volces.com/models/deepseek/DeepSeek-R1/main.tf</code>).
Run
terraform init,
terraform plan, and
terraform applyto create GPU resources.
Verify container logs for successful model loading.
Test the service with a curl request.
Option 2 Steps
Create a High‑Performance Computing (HPC) cluster and a VKE Kubernetes cluster (K8s 1.28, VPC‑CNI).
Install required plugins: nvidia‑device‑plugin, rdma‑device‑plugin, CSI‑TOS, prometheus‑agent.
Configure GPU and RDMA resources (e.g., ecs.ebmhpcpni3l, 8 GPU per node).
Deploy the LeaderWorkerSet CRD (
<code>kubectl apply --server-side -f manifest.yaml</code>).
Create the SGLang inference workload using a YAML manifest that defines leader and worker templates, GPU limits, RDMA annotations, and volume mounts for the model.
Expose the service via a LoadBalancer Service.
Monitoring and Autoscaling
Enable cloud‑native observability (Prometheus‑VMP) to track GPU utilization, latency, and throughput. Create a ServiceMonitor for the inference service and configure a HorizontalPodAutoscaler that scales the LWS based on the
k8s_pod_gpu_prof_sm_activemetric.
<code>apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: sglang-hpa
spec:
minReplicas: 1
maxReplicas: 5
metrics:
- pods:
metric:
name: k8s_pod_gpu_prof_sm_active
target:
type: AverageValue
averageValue: "0.3"
type: Pods
scaleTargetRef:
apiVersion: leaderworkerset.x-k8s.io/v1
kind: LeaderWorkerSet
name: sglang
behavior:
scaleDown:
policies:
- periodSeconds: 300
type: Pods
value: 2
- periodSeconds: 300
type: Percent
value: 5
selectPolicy: Max
stabilizationWindowSeconds: 300
scaleUp:
policies:
- periodSeconds: 15
type: Pods
value: 2
- periodSeconds: 15
type: Percent
value: 15
selectPolicy: Max
stabilizationWindowSeconds: 0</code>Storage Acceleration
Integrate Fluid + CFS Runtime to cache model files from TOS, reducing download time by ~30 %.
<code>apiVersion: apps/v1
kind: Deployment
metadata:
name: nginx-deployment
namespace: users-namespace
spec:
replicas: 2
selector:
matchLabels:
app: nginx
template:
metadata:
labels:
app: nginx
vke.volcengine.com/fluid-enable-cfs: "true"
spec:
containers:
- name: nginx
image: nginx:1.14.2
ports:
- containerPort: 80
volumeMounts:
- mountPath: /path/to/mount
name: your-volume
volumes:
- name: your-volume
persistentVolumeClaim:
claimName: your</code>Conclusion
By combining Volcengine GPU ECS, VKE, Terraform, LWS, and Fluid, you can quickly launch a production‑grade DeepSeek‑R1 inference service, monitor its health, scale elastically, and accelerate model loading, enabling enterprise‑level AI workloads.
ByteDance Cloud Native
Sharing ByteDance's cloud-native technologies, technical practices, and developer events.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.