Artificial Intelligence 18 min read

Deploying DeepSeek‑R1 Distilled Qwen‑32B‑FP8 Model on Alibaba Cloud GPU Instances with Docker and OpenWebUI

This guide explains how to prepare an Alibaba Cloud GPU instance, install Docker and NVIDIA tools, pull or build a container image, and run the FP8‑quantized DeepSeek‑R1‑Distill‑Qwen‑32B model using vLLM and OpenWebUI for both offline and online inference.

Alibaba Cloud Infrastructure
Alibaba Cloud Infrastructure
Alibaba Cloud Infrastructure
Deploying DeepSeek‑R1 Distilled Qwen‑32B‑FP8 Model on Alibaba Cloud GPU Instances with Docker and OpenWebUI

DeepSeek‑R1 is an open‑source inference model comparable to OpenAI o1, and its distilled versions (e.g., Qwen‑32B‑FP8) can run on smaller GPUs. The article demonstrates step‑by‑step deployment of the DeepSeek‑R1‑Distill‑Qwen‑32B‑FP8 model on an Alibaba Cloud ecs.gn8is.2xlarge instance (48 GB GPU memory) using Docker containers.

1. Prepare the runtime environment – Choose a suitable GPU instance, install Ubuntu 20.04, and enable CUDA 12.4.1, driver 550.127.08, and cuDNN 9.2.0.82. Ensure at least 200 GB disk space for model data and container images.

2. Install Docker – Execute the following commands:

sudo apt-get update
sudo apt-get -y install ca-certificates curl
sudo install -m 0755 -d /etc/apt/keyrings
sudo curl -fsSL http://mirrors.cloud.aliyuncs.com/docker-ce/linux/ubuntu/gpg -o /etc/apt/keyrings/docker.asc
sudo chmod a+r /etc/apt/keyrings/docker.asc

echo "deb [arch=$(dpkg --print-architecture) signed-by=/etc/apt/keyrings/docker.asc] http://mirrors.cloud.aliyuncs.com/docker-ce/linux/ubuntu $(. /etc/os-release && echo "$VERSION_CODENAME") stable" | sudo tee /etc/apt/sources.list.d/docker.list > /dev/null
sudo apt-get update
sudo apt-get install -y docker-ce docker-ce-cli containerd.io

Verify installation with docker -v .

3. Install NVIDIA container toolkit – Run:

curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey | sudo gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg \
  && curl -s -L https://nvidia.github.io/libnvidia-container/stable/deb/nvidia-container-toolkit.list | \
  sed 's#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#g' | \
  sudo tee /etc/apt/sources.list.d/nvidia-container-toolkit.list
sudo apt-get update
sudo apt-get install -y nvidia-container-toolkit

Enable Docker to start on boot and restart the service:

sudo systemctl enable docker
sudo systemctl restart docker
sudo systemctl status docker

4. Pull or build the model container

Option A – Pull a pre‑installed image:

sudo docker pull egslingjun-registry.cn-wulanchabu.cr.aliyuncs.com/egslingjun/inference-nv-pytorch:preview-25.02-dsr1distill-qwen32bfp8-vllm0.6.4.post1-pytorch2.5.1-cuda12.4-20250208

Option B – Build from the NVIDIA CUDA base image:

sudo docker pull nvcr.io/nvidia/cuda:12.4.1-runtime-ubuntu22.04

After starting the container, install required packages inside:

# Install Python 3.11 and pip
apt update && apt install -y python3.11 curl
update-alternatives --install /usr/bin/python3 python3 /usr/bin/python3.11 1
curl -sS https://bootstrap.pypa.io/get-pip.py | python3

# Install vLLM, OpenWebUI and dependencies
pip install vllm==0.6.4.post1 -i https://pypi.tuna.tsinghua.edu.cn/simple
pip install open-webui==0.5.10 -i https://pypi.tuna.tsinghua.edu.cn/simple
apt-get install -y libgl1-mesa-glx libglib2.0-0

# Download the FP8 model via git‑lfs
apt install -y git-lfs
git lfs clone https://www.modelscope.cn/okwinds/DeepSeek-R1-Distill-Qwen-32B-FP8.git

5. Run the model

Offline inference using a Python script ( vllm_demo.py ) that loads the model with vLLM:

from vllm import LLM, SamplingParams
prompts = ["Hello, my name is", "The president of the United States is", "The capital of France is", "The future of AI is"]
sampling_params = SamplingParams(temperature=0.8, top_p=0.95)
llm = LLM(model="/workspace/DeepSeek-R1-Distill-Qwen-32B-FP8", max_model_len=18432, gpu_memory_utilization=0.98)
outputs = llm.generate(prompts, sampling_params)
for out in outputs:
    print(f"Prompt: {out.prompt!r}, Generated text: {out.outputs[0].text!r}")

Online inference via the vLLM OpenAI‑compatible server:

python3 -m vllm.entrypoints.openai.api_server \
  --model /workspace/DeepSeek-R1-Distill-Qwen-32B-FP8 \
  --trust-remote-code \
  --quantization fp8 \
  --tensor-parallel-size 1 \
  --port 8000 \
  --enforce-eager \
  --gpu-memory-utilization 0.98 \
  --max-model-len 18432

Test the service with curl:

curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{"model":"/workspace/DeepSeek-R1-Distill-Qwen-32B-FP8","messages":[{"role":"user","content":"介绍一下什么是大模型推理。"}]}'

6. Launch OpenWebUI – Set environment variables and start the UI:

export ENABLE_OLLAMA_API=False
export OPENAI_API_BASE_URL=http://127.0.0.1:8000/v1
export HF_ENDPOINT=https://hf-mirror.com
export DATA_DIR=./open-webui-data
open-webui serve

Open a browser to http:// :8080 , create an admin account, and begin interacting with the model through the web interface.

The article also covers alternative quantizations (e.g., int4) and how to adjust max_model_len and gpu_memory_utilization for different GPU memory constraints.

Conclusion – By following these steps, users can deploy the FP8‑quantized DeepSeek‑R1‑Distill‑Qwen‑32B model on an Alibaba Cloud GPU instance, perform offline tests, or serve it via OpenWebUI for interactive use.

DockerLLMvLLMDeepSeekGPUFP8 QuantizationOpenWebUI
Alibaba Cloud Infrastructure
Written by

Alibaba Cloud Infrastructure

For uninterrupted computing services

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.