Step-by-Step Guide to Deploying, Testing, and Optimizing DeepSeek‑R1: A Complete Tutorial
This article provides a comprehensive, hands‑on guide for installing and configuring DeepSeek‑R1 with Ollama and vLLM, setting up multi‑node multi‑GPU environments, running performance benchmarks, optimizing runtime parameters, and even generating a full PyTorch distributed‑training script.
Ollama Deployment
Install Ollama with curl -fsSL https://ollama.com/install.sh | sh, fix network errors if needed, configure the model storage path and listening port, then run ollama run deepseek-r1:1.5b. The 1.5B model occupies 1902 MiB on a single 2080 Ti GPU. Remote deployment can be accessed via a Chatbox client.
vLLM Deployment
Because Ollama lacks production‑grade features (no multi‑node support, limited concurrency, only quantized models), the guide switches to vLLM, one of the four official DeepSeek inference frameworks.
Python Local Installation
Prepare three identical servers (Ubuntu 18.04.6, 8× RTX 2080 Ti, 376 GB RAM, 1 TB SSD). Download CUDA 11.8, the appropriate torch and vllm wheels, and install them in a new Conda environment:
conda create -n LLM python=3.11 -y
conda activate LLM
pip install ./torch-2.5.1+cu118-cp311-cp311-linux_x86_64.whl -i https://pypi.tuna.tsinghua.edu.cn/simple
pip install ./torchvision-0.20.1+cu118-cp311-cp311-linux_x86_64.whl -i https://pypi.tuna.tsinghua.edu.cn/simple
pip install ./vllm-0.7.2+cu118-cp38-abi3-manylinux1_x86_64.whl -i https://pypi.tuna.tsinghua.edu.cn/simpleMount the shared model directory via CIFS to avoid copying large model files to each node:
sudo mount -t cifs //<DATA_SERVER_IP>/<PATH>/DeepSeek /home/ubuntu/DeepSeek -o username=<USER>,password=<PASS>Launch vLLM on each node with a multi‑GPU command that sets tensor‑parallel size, model length, and dtype:
CUAD_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 python -m vllm.entrypoints.openai.api_server \
--model ~/DeepSeek/DeepSeek-R1-Distill-Qwen-32B \
--host 0.0.0.0 --port 11435 \
--tensor-parallel-size 8 --gpu-memory-utilization 0.9 \
--max-model-len 8192 --trust-remote-code \
--enforce_eager --dtype=halfDocker Installation
Update the NVIDIA driver (>=530 for the vLLM‑openai:0.7.2 image) and install nvidia‑container‑toolkit. Configure Docker to use the NVIDIA runtime by editing /etc/docker/daemon.json:
{
"default-runtime": "nvidia",
"runtimes": {
"nvidia": {"path": "nvidia-container-runtime", "runtimeArgs": []}
}
}Pull the official image ( docker pull vllm/vllm-openai:v0.7.2), tag it for the internal registry, and run the container with model mounting:
docker run --runtime nvidia --gpus all \
-v ~/DeepSeek:/root/DeepSeek -p 11436:8000 \
<REGISTRY>/vllm-openai:v0.7.2 \
--model /root/DeepSeek/DeepSeek-R1-Distill-Qwen-1.5B \
--dtype=halfAPI Configuration
Expose the model via an OpenAI‑compatible endpoint ( http://<MASTER_IP>:11435/v1) and a local Ollama API ( http://<SERVER_IP>:11434). The guide recommends the Chatbox or the Chrome Page Assist plugin for interactive testing.
Simple Test
Using vLLM 0.7.2 on a three‑node, eight‑GPU‑per‑node cluster, the article runs inference on the 70 B and 671 B variants of DeepSeek‑R1. Resource‑usage graphs are shown, and a sample Q&A demonstrates that the larger model provides more comprehensive answers.
Benchmark Testing
Defines four metrics: TTFT (time to first token), TPOT (time per output token), ITL (inter‑token latency), and Throughput (tokens per second). Test environment includes two hardware configurations (single‑node 8× V100 vs. three‑node 8× RTX 2080 Ti), vLLM 0.7.2, CUDA 11.8, Python 3.11, and the ShareGPT_V3 dataset (100 prompts, various RPS). Server command:
vllm serve ~/DeepSeek/DeepSeek-R1-Distill-Llama-70B \
--tensor-parallel-size 8 --max-model-len 24576 \
--trust-remote-code --enforce_eager --dtype=half \
--host 0.0.0.0 --port 11435 \
--served_model_name DeepSeek-R1-Distill-Llama-70BClient command:
python benchmarks/benchmark_serving.py \
--backend openai-chat --model /home/ubuntu/DeepSeek/DeepSeek-R1-Distill-Llama-70B \
--served-model-name DeepSeek-R1-Distill-Llama-70B \
--dataset /home/ubuntu/dataset/LLM/ShareGPT_V3_unfiltered_cleaned_split.json \
--request-rate 10 --num-prompts 100 \
--host <SERVER_IP> --port 11435 \
--endpoint /v1/chat/completionsResults (presented as images) show that TTFT grows sharply with higher RPS, while TPOT, ITL, and Throughput remain relatively stable. Single‑node deployment yields lower TPOT/ITL but higher Throughput compared with multi‑node deployment, indicating communication overhead in distributed settings.
PyTorch Distributed‑Training Example
The article supplies a complete, commented Python script that demonstrates multi‑node, multi‑GPU training with torch.distributed and DistributedDataParallel. It covers argument parsing ( --nodes, --gpus, --nr), environment variable setup ( MASTER_ADDR, MASTER_PORT), rank calculation, NCCL backend initialization, model definition (ResNet‑50 example), custom CustomDataset, DistributedSampler, training loop with epoch‑wise sampler.set_epoch, loss logging only on rank 0, checkpoint saving, and graceful shutdown with dist.destroy_process_group(). Launch commands using torchrun for a 2‑node, 4‑GPU‑per‑node cluster are provided.
Movie Introduction Example
To illustrate the model’s generative capabilities, the guide includes a structured English description of the Chinese animated film "Nezha: The Devil’s Child Causes a Sea‑Rage". The description covers director, release date, plot summary, main characters, artistic style, box‑office performance, awards, and cultural impact, demonstrating how the same pipeline can be used for content generation.
Overall, the article walks the reader through every step—from low‑level driver and CUDA preparation, through Python and Docker installation, to distributed inference, benchmarking, and downstream content generation—while explaining the rationale behind each choice and providing concrete commands, configuration files, and performance data.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
AIWalker
Focused on computer vision, image processing, color science, and AI algorithms; sharing hardcore tech, engineering practice, and deep insights as a diligent AI technology practitioner.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
