Artificial Intelligence 39 min read

Step-by-Step Guide to Deploying, Testing, and Optimizing DeepSeek‑R1: A Complete Tutorial

This article provides a comprehensive, hands‑on guide for installing and configuring DeepSeek‑R1 with Ollama and vLLM, setting up multi‑node multi‑GPU environments, running performance benchmarks, optimizing runtime parameters, and even generating a full PyTorch distributed‑training script.

AIWalker

Feb 27, 2025

Step-by-Step Guide to Deploying, Testing, and Optimizing DeepSeek‑R1: A Complete Tutorial

Ollama Deployment

Install Ollama with curl -fsSL https://ollama.com/install.sh | sh, fix network errors if needed, configure the model storage path and listening port, then run ollama run deepseek-r1:1.5b. The 1.5B model occupies 1902 MiB on a single 2080 Ti GPU. Remote deployment can be accessed via a Chatbox client.

vLLM Deployment

Because Ollama lacks production‑grade features (no multi‑node support, limited concurrency, only quantized models), the guide switches to vLLM, one of the four official DeepSeek inference frameworks.

Python Local Installation

Prepare three identical servers (Ubuntu 18.04.6, 8× RTX 2080 Ti, 376 GB RAM, 1 TB SSD). Download CUDA 11.8, the appropriate torch and vllm wheels, and install them in a new Conda environment:

conda create -n LLM python=3.11 -y
conda activate LLM
pip install ./torch-2.5.1+cu118-cp311-cp311-linux_x86_64.whl -i https://pypi.tuna.tsinghua.edu.cn/simple
pip install ./torchvision-0.20.1+cu118-cp311-cp311-linux_x86_64.whl -i https://pypi.tuna.tsinghua.edu.cn/simple
pip install ./vllm-0.7.2+cu118-cp38-abi3-manylinux1_x86_64.whl -i https://pypi.tuna.tsinghua.edu.cn/simple

Mount the shared model directory via CIFS to avoid copying large model files to each node:

sudo mount -t cifs //<DATA_SERVER_IP>/<PATH>/DeepSeek /home/ubuntu/DeepSeek -o username=<USER>,password=<PASS>

Launch vLLM on each node with a multi‑GPU command that sets tensor‑parallel size, model length, and dtype:

CUAD_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 python -m vllm.entrypoints.openai.api_server \
  --model ~/DeepSeek/DeepSeek-R1-Distill-Qwen-32B \
  --host 0.0.0.0 --port 11435 \
  --tensor-parallel-size 8 --gpu-memory-utilization 0.9 \
  --max-model-len 8192 --trust-remote-code \
  --enforce_eager --dtype=half

Docker Installation

Update the NVIDIA driver (>=530 for the vLLM‑openai:0.7.2 image) and install nvidia‑container‑toolkit. Configure Docker to use the NVIDIA runtime by editing /etc/docker/daemon.json:

{
  "default-runtime": "nvidia",
  "runtimes": {
    "nvidia": {"path": "nvidia-container-runtime", "runtimeArgs": []}
  }
}

Pull the official image ( docker pull vllm/vllm-openai:v0.7.2), tag it for the internal registry, and run the container with model mounting:

docker run --runtime nvidia --gpus all \
  -v ~/DeepSeek:/root/DeepSeek -p 11436:8000 \
  <REGISTRY>/vllm-openai:v0.7.2 \
  --model /root/DeepSeek/DeepSeek-R1-Distill-Qwen-1.5B \
  --dtype=half

API Configuration

Expose the model via an OpenAI‑compatible endpoint ( http://<MASTER_IP>:11435/v1) and a local Ollama API ( http://<SERVER_IP>:11434). The guide recommends the Chatbox or the Chrome Page Assist plugin for interactive testing.

Simple Test

Using vLLM 0.7.2 on a three‑node, eight‑GPU‑per‑node cluster, the article runs inference on the 70 B and 671 B variants of DeepSeek‑R1. Resource‑usage graphs are shown, and a sample Q&A demonstrates that the larger model provides more comprehensive answers.

Benchmark Testing

Defines four metrics: TTFT (time to first token), TPOT (time per output token), ITL (inter‑token latency), and Throughput (tokens per second). Test environment includes two hardware configurations (single‑node 8× V100 vs. three‑node 8× RTX 2080 Ti), vLLM 0.7.2, CUDA 11.8, Python 3.11, and the ShareGPT_V3 dataset (100 prompts, various RPS). Server command:

vllm serve ~/DeepSeek/DeepSeek-R1-Distill-Llama-70B \
  --tensor-parallel-size 8 --max-model-len 24576 \
  --trust-remote-code --enforce_eager --dtype=half \
  --host 0.0.0.0 --port 11435 \
  --served_model_name DeepSeek-R1-Distill-Llama-70B

Client command:

python benchmarks/benchmark_serving.py \
  --backend openai-chat --model /home/ubuntu/DeepSeek/DeepSeek-R1-Distill-Llama-70B \
  --served-model-name DeepSeek-R1-Distill-Llama-70B \
  --dataset /home/ubuntu/dataset/LLM/ShareGPT_V3_unfiltered_cleaned_split.json \
  --request-rate 10 --num-prompts 100 \
  --host <SERVER_IP> --port 11435 \
  --endpoint /v1/chat/completions

Results (presented as images) show that TTFT grows sharply with higher RPS, while TPOT, ITL, and Throughput remain relatively stable. Single‑node deployment yields lower TPOT/ITL but higher Throughput compared with multi‑node deployment, indicating communication overhead in distributed settings.

PyTorch Distributed‑Training Example

The article supplies a complete, commented Python script that demonstrates multi‑node, multi‑GPU training with torch.distributed and DistributedDataParallel. It covers argument parsing ( --nodes, --gpus, --nr), environment variable setup ( MASTER_ADDR, MASTER_PORT), rank calculation, NCCL backend initialization, model definition (ResNet‑50 example), custom CustomDataset, DistributedSampler, training loop with epoch‑wise sampler.set_epoch, loss logging only on rank 0, checkpoint saving, and graceful shutdown with dist.destroy_process_group(). Launch commands using torchrun for a 2‑node, 4‑GPU‑per‑node cluster are provided.

Movie Introduction Example

To illustrate the model’s generative capabilities, the guide includes a structured English description of the Chinese animated film "Nezha: The Devil’s Child Causes a Sea‑Rage". The description covers director, release date, plot summary, main characters, artistic style, box‑office performance, awards, and cultural impact, demonstrating how the same pipeline can be used for content generation.

Overall, the article walks the reader through every step—from low‑level driver and CUDA preparation, through Python and Docker installation, to distributed inference, benchmarking, and downstream content generation—while explaining the rationale behind each choice and providing concrete commands, configuration files, and performance data.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

vllm Performance Benchmark DeepSeek-R1 distributed training GPU Optimization LLM deployment

Written by

AIWalker

Focused on computer vision, image processing, color science, and AI algorithms; sharing hardcore tech, engineering practice, and deep insights as a diligent AI technology practitioner.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.