Cloud Computing 22 min read

Deploy the Full‑Size DeepSeek‑R1 Model on Volcengine Cloud with Terraform and Kubernetes

This guide walks you through two practical solutions for deploying the massive DeepSeek‑R1 model on Volcengine Cloud—one using Terraform for a quick two‑node GPU setup and another leveraging cloud‑native multi‑node distributed inference with Kubernetes, covering resource sizing, environment preparation, model download, monitoring, autoscaling, and storage acceleration.

ByteDance Cloud Native
ByteDance Cloud Native
ByteDance Cloud Native
Deploy the Full‑Size DeepSeek‑R1 Model on Volcengine Cloud with Terraform and Kubernetes

Introduction

Enterprises increasingly need private, high‑performance AI inference services; deploying the full‑size DeepSeek‑R1 (671 B parameters) on Volcengine Cloud addresses data privacy, latency, and scalability challenges.

Deployment Options

Option 1 – Terraform One‑Click Deployment : Use an IaC script to provision two 8‑GPU ECS instances, download the model over the internal network, and launch containers.

Option 2 – Cloud‑Native Multi‑Node Distributed Inference : Deploy a Kubernetes VKE cluster with Leader‑Worker Set (LWS) to manage leader and worker pods, enable RDMA networking, and use SGLang as the inference engine.

Option 1 Steps

Install Terraform and initialize the environment (

<code>wget https://public-terraform-cn-beijing.tos-cn-beijing.volces.com/models/deepseek/DeepSeek-R1/main.tf</code>

).

Run

terraform init

,

terraform plan

, and

terraform apply

to create GPU resources.

Verify container logs for successful model loading.

Test the service with a curl request.

Option 2 Steps

Create a High‑Performance Computing (HPC) cluster and a VKE Kubernetes cluster (K8s 1.28, VPC‑CNI).

Install required plugins: nvidia‑device‑plugin, rdma‑device‑plugin, CSI‑TOS, prometheus‑agent.

Configure GPU and RDMA resources (e.g., ecs.ebmhpcpni3l, 8 GPU per node).

Deploy the LeaderWorkerSet CRD (

<code>kubectl apply --server-side -f manifest.yaml</code>

).

Create the SGLang inference workload using a YAML manifest that defines leader and worker templates, GPU limits, RDMA annotations, and volume mounts for the model.

Expose the service via a LoadBalancer Service.

Monitoring and Autoscaling

Enable cloud‑native observability (Prometheus‑VMP) to track GPU utilization, latency, and throughput. Create a ServiceMonitor for the inference service and configure a HorizontalPodAutoscaler that scales the LWS based on the

k8s_pod_gpu_prof_sm_active

metric.

<code>apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: sglang-hpa
spec:
  minReplicas: 1
  maxReplicas: 5
  metrics:
  - pods:
      metric:
        name: k8s_pod_gpu_prof_sm_active
      target:
        type: AverageValue
        averageValue: "0.3"
    type: Pods
  scaleTargetRef:
    apiVersion: leaderworkerset.x-k8s.io/v1
    kind: LeaderWorkerSet
    name: sglang
  behavior:
    scaleDown:
      policies:
      - periodSeconds: 300
        type: Pods
        value: 2
      - periodSeconds: 300
        type: Percent
        value: 5
      selectPolicy: Max
      stabilizationWindowSeconds: 300
    scaleUp:
      policies:
      - periodSeconds: 15
        type: Pods
        value: 2
      - periodSeconds: 15
        type: Percent
        value: 15
      selectPolicy: Max
      stabilizationWindowSeconds: 0</code>

Storage Acceleration

Integrate Fluid + CFS Runtime to cache model files from TOS, reducing download time by ~30 %.

<code>apiVersion: apps/v1
kind: Deployment
metadata:
  name: nginx-deployment
  namespace: users-namespace
spec:
  replicas: 2
  selector:
    matchLabels:
      app: nginx
  template:
    metadata:
      labels:
        app: nginx
        vke.volcengine.com/fluid-enable-cfs: "true"
    spec:
      containers:
      - name: nginx
        image: nginx:1.14.2
        ports:
        - containerPort: 80
        volumeMounts:
        - mountPath: /path/to/mount
          name: your-volume
      volumes:
      - name: your-volume
        persistentVolumeClaim:
          claimName: your</code>

Conclusion

By combining Volcengine GPU ECS, VKE, Terraform, LWS, and Fluid, you can quickly launch a production‑grade DeepSeek‑R1 inference service, monitor its health, scale elastically, and accelerate model loading, enabling enterprise‑level AI workloads.

cloud computingAIModel DeploymentkubernetesTerraform
ByteDance Cloud Native
Written by

ByteDance Cloud Native

Sharing ByteDance's cloud-native technologies, technical practices, and developer events.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.