Deploying DeepSeek‑R1 Distilled Qwen‑32B‑FP8 Model on Alibaba Cloud GPU Instances with Docker and OpenWebUI
This guide explains how to prepare an Alibaba Cloud GPU instance, install Docker and NVIDIA tools, pull or build a container image, and run the FP8‑quantized DeepSeek‑R1‑Distill‑Qwen‑32B model using vLLM and OpenWebUI for both offline and online inference.
DeepSeek‑R1 is an open‑source inference model comparable to OpenAI o1, and its distilled versions (e.g., Qwen‑32B‑FP8) can run on smaller GPUs. The article demonstrates step‑by‑step deployment of the DeepSeek‑R1‑Distill‑Qwen‑32B‑FP8 model on an Alibaba Cloud ecs.gn8is.2xlarge instance (48 GB GPU memory) using Docker containers.
1. Prepare the runtime environment – Choose a suitable GPU instance, install Ubuntu 20.04, and enable CUDA 12.4.1, driver 550.127.08, and cuDNN 9.2.0.82. Ensure at least 200 GB disk space for model data and container images.
2. Install Docker – Execute the following commands:
sudo apt-get update
sudo apt-get -y install ca-certificates curl
sudo install -m 0755 -d /etc/apt/keyrings
sudo curl -fsSL http://mirrors.cloud.aliyuncs.com/docker-ce/linux/ubuntu/gpg -o /etc/apt/keyrings/docker.asc
sudo chmod a+r /etc/apt/keyrings/docker.asc
echo "deb [arch=$(dpkg --print-architecture) signed-by=/etc/apt/keyrings/docker.asc] http://mirrors.cloud.aliyuncs.com/docker-ce/linux/ubuntu $(. /etc/os-release && echo "$VERSION_CODENAME") stable" | sudo tee /etc/apt/sources.list.d/docker.list > /dev/null
sudo apt-get update
sudo apt-get install -y docker-ce docker-ce-cli containerd.ioVerify installation with docker -v .
3. Install NVIDIA container toolkit – Run:
curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey | sudo gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg \
&& curl -s -L https://nvidia.github.io/libnvidia-container/stable/deb/nvidia-container-toolkit.list | \
sed 's#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#g' | \
sudo tee /etc/apt/sources.list.d/nvidia-container-toolkit.list
sudo apt-get update
sudo apt-get install -y nvidia-container-toolkitEnable Docker to start on boot and restart the service:
sudo systemctl enable docker
sudo systemctl restart docker
sudo systemctl status docker4. Pull or build the model container
Option A – Pull a pre‑installed image:
sudo docker pull egslingjun-registry.cn-wulanchabu.cr.aliyuncs.com/egslingjun/inference-nv-pytorch:preview-25.02-dsr1distill-qwen32bfp8-vllm0.6.4.post1-pytorch2.5.1-cuda12.4-20250208Option B – Build from the NVIDIA CUDA base image:
sudo docker pull nvcr.io/nvidia/cuda:12.4.1-runtime-ubuntu22.04After starting the container, install required packages inside:
# Install Python 3.11 and pip
apt update && apt install -y python3.11 curl
update-alternatives --install /usr/bin/python3 python3 /usr/bin/python3.11 1
curl -sS https://bootstrap.pypa.io/get-pip.py | python3
# Install vLLM, OpenWebUI and dependencies
pip install vllm==0.6.4.post1 -i https://pypi.tuna.tsinghua.edu.cn/simple
pip install open-webui==0.5.10 -i https://pypi.tuna.tsinghua.edu.cn/simple
apt-get install -y libgl1-mesa-glx libglib2.0-0
# Download the FP8 model via git‑lfs
apt install -y git-lfs
git lfs clone https://www.modelscope.cn/okwinds/DeepSeek-R1-Distill-Qwen-32B-FP8.git5. Run the model
Offline inference using a Python script ( vllm_demo.py ) that loads the model with vLLM:
from vllm import LLM, SamplingParams
prompts = ["Hello, my name is", "The president of the United States is", "The capital of France is", "The future of AI is"]
sampling_params = SamplingParams(temperature=0.8, top_p=0.95)
llm = LLM(model="/workspace/DeepSeek-R1-Distill-Qwen-32B-FP8", max_model_len=18432, gpu_memory_utilization=0.98)
outputs = llm.generate(prompts, sampling_params)
for out in outputs:
print(f"Prompt: {out.prompt!r}, Generated text: {out.outputs[0].text!r}")Online inference via the vLLM OpenAI‑compatible server:
python3 -m vllm.entrypoints.openai.api_server \
--model /workspace/DeepSeek-R1-Distill-Qwen-32B-FP8 \
--trust-remote-code \
--quantization fp8 \
--tensor-parallel-size 1 \
--port 8000 \
--enforce-eager \
--gpu-memory-utilization 0.98 \
--max-model-len 18432Test the service with curl:
curl http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{"model":"/workspace/DeepSeek-R1-Distill-Qwen-32B-FP8","messages":[{"role":"user","content":"介绍一下什么是大模型推理。"}]}'6. Launch OpenWebUI – Set environment variables and start the UI:
export ENABLE_OLLAMA_API=False
export OPENAI_API_BASE_URL=http://127.0.0.1:8000/v1
export HF_ENDPOINT=https://hf-mirror.com
export DATA_DIR=./open-webui-data
open-webui serveOpen a browser to http:// :8080 , create an admin account, and begin interacting with the model through the web interface.
The article also covers alternative quantizations (e.g., int4) and how to adjust max_model_len and gpu_memory_utilization for different GPU memory constraints.
Conclusion – By following these steps, users can deploy the FP8‑quantized DeepSeek‑R1‑Distill‑Qwen‑32B model on an Alibaba Cloud GPU instance, perform offline tests, or serve it via OpenWebUI for interactive use.
Alibaba Cloud Infrastructure
For uninterrupted computing services
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.