Artificial Intelligence 14 min read

How to Deploy DeepSeek‑R1 671B on AIBrix: Multi‑Node GPU Inference in Hours

This guide explains how to use the AIBrix distributed inference platform to deploy the massive DeepSeek‑R1 671B model across multiple GPU nodes, covering cluster setup, custom vLLM images, storage options, RDMA networking, autoscaling, request handling, and observability, turning a weeks‑long deployment into an hour‑scale process.

ByteDance Cloud Native
ByteDance Cloud Native
ByteDance Cloud Native
How to Deploy DeepSeek‑R1 671B on AIBrix: Multi‑Node GPU Inference in Hours

DeepSeek‑R1 671B demonstrates strong logical reasoning with 6.71 trillion total parameters, 370 billion active parameters, and a 128k context window, but its size creates demanding deployment challenges.

AIBrix provides container‑orchestrated solutions for multi‑node GPU resource allocation, seamless distributed inference management, RDMA‑based high‑performance networking, and automated elastic scaling, reducing deployment time from weeks to hours.

1. Prerequisites

Download model weights to object storage or a shared filesystem and prepare a custom container image. Example cluster configuration on Volcano Engine includes two

ecs.ebmhpcpni3l.48xlarge

instances with 96 GB × 8 GPUs, 192 vCPU, 2048 GiB RAM, 400 Gbps × 8 RDMA, and local NVMe disks.

1.1 Cluster Configuration

Cloud platform: Volcano Engine

Instance: ecs.ebmhpcpni3l.48xlarge × 2

CPU: 192 vCPU

Memory: 2048 GiB DRAM

GPU: 96 GB × 8

Network: 400 Gbps × 8 RDMA + 96 Gbps

Disk: NVMe 3576 GiB × 4

1.2 vLLM Image

Use the custom image

aibrix/vllm-openai:v0.7.3.self.post1

. It upgrades

nvidia-nccl-cu12==2.25.1

to fix NCCL hangs and reinstalls

ray[default,adag]==2.40.0

to address a Ray regression.

<code>FROM vllm/vllm-openai:v0.7.3
RUN pip3 install -U ray[default,adag]==2.40.0
RUN pip3 install -U nvidia-nccl-cu12
ENTRYPOINT [""]
</code>
For users in China, prepend aibrix-container-registry-cn-beijing.cr.volces.com/ to the image name.

1.3 Model Weights

Four storage options are discussed: HuggingFace (not recommended for DeepSeek‑R1), Persistent Volumes via cloud CSI, object storage (e.g., S3, GCS), and local disks.

1.4 High‑Performance Network

Configure pod annotations

k8s.volcengine.com/pod‑networks

with RDMA CNI and set

vke.volcengine.com/rdma: "8"

. Add

IPC_LOCK

capability in the security context.

<code>k8s.volcengine.com/pod-networks: |
  [
    {"cniConf":{"name":"rdma"}},
    ...
  ]
securityContext:
  capabilities:
    add:
    - IPC_LOCK
</code>

2. Component Installation

Install AIBrix v0.2.1 core and dependencies:

<code>kubectl create -f https://github.com/vllm-project/aibrix/releases/download/v0.2.1/aibrix-dependency-v0.2.1.yaml
kubectl create -f https://github.com/vllm-project/aibrix/releases/download/v0.2.1/aibrix-core-v0.2.1.yaml
</code>

3. How AIBrix Supports DeepSeek‑R1

AIBrix orchestrates

RayClusterFleet

,

Gateway‑Plugin

, and

Autoscaler

to manage distributed inference, route traffic to the head node, and provide autoscaling based on pod metrics.

4. Model Deployment

Apply the runtime and autoscaling manifests:

<code>kubectl apply -f deepseek-r1-ai-runtime.yaml
kubectl apply -f deepseek-r1-autoscaling.yaml
</code>

Verify pods are running, e.g.,

deepseek-r1-671b-...-head-...

and worker pods.

5. Sending Requests

Expose the endpoint via LoadBalancer or port‑forwarding and send a chat completion request:

<code># LoadBalancer
LB_IP=$(kubectl get svc/envoy-aibrix-system-aibrix-eg-903790dc -n envoy-gateway-system -o=jsonpath='{.status.loadBalancer.ingress[0].ip}')
ENDPOINT="${LB_IP}:80"

# Port‑forward (no LB)
kubectl -n envoy-gateway-system port-forward service/envoy-aibrix-system-aibrix-eg-903790dc 8888:80 &
ENDPOINT="localhost:8888"

curl http://${ENDPOINT}/v1/chat/completions \
    -H "Content-Type: application/json" -H "routing-strategy: least-request" \
    -d '{"model":"deepseek-r1-671b","messages":[{"role":"system","content":"You are a helpful assistant."},{"role":"user","content":"Who won the world series in 2020?"}]}'
</code>
Remove the routing-strategy header to use the default Kubernetes routing.

6. Observability

Deploy a

ServiceMonitor

to collect metrics from the Ray head pod:

<code>apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: deepseek-r1-svc-discover
  namespace: default
  labels:
    volcengine.vmp: "true"
spec:
  endpoints:
  - port: service
  namespaceSelector:
    matchNames:
    - default
  selector:
    matchLabels:
      ray.io/node-type: head
</code>

Import the provided Grafana dashboard (link in the original article) to visualize model performance.

7. Further Help

For questions, join the AIBrix Slack channel.

图片
图片
Distributed InferencevLLMDeepSeek-R1GPU ClusterAIBrix
ByteDance Cloud Native
Written by

ByteDance Cloud Native

Sharing ByteDance's cloud-native technologies, technical practices, and developer events.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.