Cloud Computing 22 min read

Deploy the Full‑Size DeepSeek‑R1 Model on Volcengine Cloud with Terraform and Kubernetes

This guide walks you through two practical solutions for deploying the massive DeepSeek‑R1 model on Volcengine Cloud—one using Terraform for a quick two‑node GPU setup and another leveraging cloud‑native multi‑node distributed inference with Kubernetes, covering resource sizing, environment preparation, model download, monitoring, autoscaling, and storage acceleration.

ByteDance Cloud Native

Feb 13, 2025

Deploy the Full‑Size DeepSeek‑R1 Model on Volcengine Cloud with Terraform and Kubernetes

Introduction

Enterprises increasingly need private, high‑performance AI inference services; deploying the full‑size DeepSeek‑R1 (671 B parameters) on Volcengine Cloud addresses data privacy, latency, and scalability challenges.

Deployment Options

Option 1 – Terraform One‑Click Deployment : Use an IaC script to provision two 8‑GPU ECS instances, download the model over the internal network, and launch containers.

Option 2 – Cloud‑Native Multi‑Node Distributed Inference : Deploy a Kubernetes VKE cluster with Leader‑Worker Set (LWS) to manage leader and worker pods, enable RDMA networking, and use SGLang as the inference engine.

Option 1 Steps

Install Terraform and initialize the environment (

wget https://public-terraform-cn-beijing.tos-cn-beijing.volces.com/models/deepseek/DeepSeek-R1/main.tf

Run terraform init, terraform plan, and terraform apply to create GPU resources.

Verify container logs for successful model loading.

Test the service with a curl request.

Option 2 Steps

Create a High‑Performance Computing (HPC) cluster and a VKE Kubernetes cluster (K8s 1.28, VPC‑CNI).

Install required plugins: nvidia‑device‑plugin, rdma‑device‑plugin, CSI‑TOS, prometheus‑agent.

Configure GPU and RDMA resources (e.g., ecs.ebmhpcpni3l, 8 GPU per node).

Deploy the LeaderWorkerSet CRD ( kubectl apply --server-side -f manifest.yaml).

Create the SGLang inference workload using a YAML manifest that defines leader and worker templates, GPU limits, RDMA annotations, and volume mounts for the model.

Expose the service via a LoadBalancer Service.

Monitoring and Autoscaling

Enable cloud‑native observability (Prometheus‑VMP) to track GPU utilization, latency, and throughput. Create a ServiceMonitor for the inference service and configure a HorizontalPodAutoscaler that scales the LWS based on the k8s_pod_gpu_prof_sm_active metric.

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: sglang-hpa
spec:
  minReplicas: 1
  maxReplicas: 5
  metrics:
  - pods:
      metric:
        name: k8s_pod_gpu_prof_sm_active
      target:
        type: AverageValue
        averageValue: "0.3"
    type: Pods
  scaleTargetRef:
    apiVersion: leaderworkerset.x-k8s.io/v1
    kind: LeaderWorkerSet
    name: sglang
  behavior:
    scaleDown:
      policies:
      - periodSeconds: 300
        type: Pods
        value: 2
      - periodSeconds: 300
        type: Percent
        value: 5
      selectPolicy: Max
      stabilizationWindowSeconds: 300
    scaleUp:
      policies:
      - periodSeconds: 15
        type: Pods
        value: 2
      - periodSeconds: 15
        type: Percent
        value: 15
      selectPolicy: Max
      stabilizationWindowSeconds: 0

Storage Acceleration

Integrate Fluid + CFS Runtime to cache model files from TOS, reducing download time by ~30 %.

apiVersion: apps/v1
kind: Deployment
metadata:
  name: nginx-deployment
  namespace: users-namespace
spec:
  replicas: 2
  selector:
    matchLabels:
      app: nginx
  template:
    metadata:
      labels:
        app: nginx
        vke.volcengine.com/fluid-enable-cfs: "true"
    spec:
      containers:
      - name: nginx
        image: nginx:1.14.2
        ports:
        - containerPort: 80
        volumeMounts:
        - mountPath: /path/to/mount
          name: your-volume
      volumes:
      - name: your-volume
        persistentVolumeClaim:
          claimName: your

Conclusion

By combining Volcengine GPU ECS, VKE, Terraform, LWS, and Fluid, you can quickly launch a production‑grade DeepSeek‑R1 inference service, monitor its health, scale elastically, and accelerate model loading, enabling enterprise‑level AI workloads.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

AI Model Deployment Kubernetes Terraform

Written by

ByteDance Cloud Native

Sharing ByteDance's cloud-native technologies, technical practices, and developer events.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.

Introduction

Deployment Options

Option 1 Steps

Option 2 Steps

Monitoring and Autoscaling

Storage Acceleration

Conclusion

ByteDance Cloud Native

How this landed with the community

Was this worth your time?

0 Comments

Option 1 Steps

Option 2 Steps