Artificial Intelligence 6 min read

Deploy DeepSeek‑R1‑Distill on Volcengine CPU Cloud for Low‑Cost AI Inference

This guide walks you through deploying the DeepSeek‑R1‑Distill model on Volcengine CPU ECS instances, covering use‑case scenarios, recommended server types, Docker setup, environment configuration, and verification steps to achieve cost‑effective, high‑compatibility AI inference.

ByteDance Cloud Native
ByteDance Cloud Native
ByteDance Cloud Native
Deploy DeepSeek‑R1‑Distill on Volcengine CPU Cloud for Low‑Cost AI Inference

This is the third part of the "Cloud Practice" series, presenting a solution for deploying the DeepSeek‑R1‑Distill model service on Volcengine CPU cloud servers, which offers advantages in cost, universality, maintenance, scalability, and energy consumption.

Personal trial: low AI performance needs, lower CPU cost, sufficient for typical experience.

Enterprise API debugging: CPU deployment avoids GPU driver and CUDA compatibility issues, reducing development and management costs.

Lightweight model demand: small‑scale tasks (low‑frequency calls, small batch data) can be handled by multi‑core CPUs, suitable for internal knowledge‑base Q&A systems.

Testing on an ecs.c3il.8xlarge instance shows a throughput of 14 tokens/s with bf16 precision, meeting normal usage requirements.

Deployment Overview

We recommend different Volcengine CPU ECS types for various model sizes; ensure memory exceeds the model size.

Deployment configuration
Deployment configuration

Step 1: Create ECS Instance

Log in to the Volcengine ECS console ( https://console.volcengine.com/ecs ), select region/az, choose an appropriate instance type, and configure storage. The example uses a Shanghai region instance.

Step 2: Deploy Docker Environment and Enable the Model

Install Docker on the instance:

<code>sudo apt update</code>
<code>sudo apt install docker.io</code>

Run the Docker container with the DeepSeek‑R1‑Distill model:

<code>docker run -d --network host --privileged --shm-size 15g -v /data00/models:/data00/models -e MODEL_PATH=/data00/models -e PORT=8000 -e MODEL_NAME=DeepSeek-R1-Distill-Qwen-7B -e DTYPE=bf16 -e KV_CACHE_DTYPE=fp16 ai-containers-cn-shanghai.cr.volces.com/deeplearning/xft-vllm:1.8.2.iaas bash /llama2/entrypoint.sh</code>

For the Beijing region, use the image

ai-containers-cn-beijing.cr.volces.com/deeplearning/xft-vllm:1.8.2.iaas

.

Environment variable details are illustrated below:

Environment variables
Environment variables

Step 3: Test and Verify

Execute a curl request to confirm the service is running:

<code>curl http://127.0.0.1:8000/v1/chat/completions -H "Content-Type: application/json" -d '{
    "model": "xft",
    "messages":[{"role":"user","content":"你好!请问你是谁?"}],
    "max_tokens": 256,
    "temperature": 0.6
}'</code>

The response image shows a successful reply:

Curl test result
Curl test result

Conclusion

The entire process demonstrates how to quickly launch the DeepSeek‑R1‑Distill model service using Volcengine CPU cloud products, offering a low‑cost, high‑compatibility solution for interested users.

DockerDeepSeekAI Model DeploymentVolcengineCPU inferenceLLM serving
ByteDance Cloud Native
Written by

ByteDance Cloud Native

Sharing ByteDance's cloud-native technologies, technical practices, and developer events.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.