Step-by-Step Guide to Local Training of DeepSeek R1 on Multi‑GPU A100 Systems
This step‑by‑step tutorial shows how to set up CUDA 12.4, install required packages, prepare a JSON dataset and custom reward, troubleshoot out‑of‑memory errors, and launch DeepSeek R1 training on an 8‑GPU A100 cluster using Accelerate, Deepspeed zero‑3 and vLLM configurations.
This article provides a comprehensive, hands‑on tutorial for training the DeepSeek R1 large language model locally, focusing on the practical challenges of environment setup, CUDA compatibility, and common pitfalls such as out‑of‑memory (OOM) errors.
1. Environment Setup
GPU driver and CUDA version : open‑r1 requires CUDA 12.4. Verify your driver version (≥ 470) and upgrade if necessary. Example check:
# 查看自己的显卡版本与cuda是否适配
import torch
print(torch.cuda.is_available()) # True indicates compatibilityQuick installation using Conda:
1. conda create -n openr1 python=3.11
2. pip install torch==2.5.1 torchvision==0.20.1 torchaudio==2.5.1 --index-url https://download.pytorch.org/whl/cu124
3. pip install vllm==0.7.2
4. pip install flash-attn
5. cd open-r1 && pip install -e "[dev]"2. Training Pitfalls and OOM Debugging
When training the Qwen‑14B model on 8×A100 (40 GB), OOM can arise in two places:
Training phase (7 GPUs) – fix by editing recipes/accelerate_configs/zero3.yaml and enabling optimizer/offload offload.
Inference phase (1 GPU) – if using vllm < 0.7.3, lower vllm_gpu_memory_utilization (e.g., 0.2 for 14B, 0.5 for 7B).
Too large vllm_max_model_len – reduce to 4k‑8k depending on prompt+output length.
Identify the source of OOM by checking the GPU index in the error message; the last GPU (e.g., GPU 7) is typically used for inference.
3. Data Preparation and Reward Function
Prepare a JSON dataset data.json with fields problem and solution matching the model’s expected schema:
{"problem": "Classify the text into neutral, negative, or positive\nText: I think the food was okay.\nSentiment:\n", "solution": "positive"}
{"problem": "Classify the text into neutral, negative, or positive\nText: I think the food was shit.\nSentiment:\n", "solution": "negative"}Modify grpo.py to load the offline dataset instead of Hub data:
dataset = load_dataset("json", data_files=XXX/data.json)
dataset = dataset["train"].train_test_split(test_size=0.02)Custom reward function must keep argument names consistent with dataset columns. Example:
def accuracy_reward_ours(completions, solution, **kwargs):
"""Reward function that checks if the completion matches the ground truth."""
contents = [c["content"] for c in completions]
rewards = []
for content, sol in zip(contents, solution):
gold = sol
if len(gold) != 0:
answer = re.findall("
(.*?)
", content)
if len(answer) > 0:
answer = answer[0]
reward = float(1 if answer == gold else 0)
else:
reward = float(0)
else:
reward = 1.0
print("Failed to parse gold solution: ", sol)
rewards.append(reward)
return rewards4. Launch Command
Run the training with Accelerate and the prepared configuration files:
ACCELERATE_LOG_LEVEL=info accelerate launch --config_file recipes/accelerate_configs/zero3.yaml \
--num_processes=7 src/open_r1/grpo.py \
--config recipes/Qwen2.5-14B-Instruct/grpo/config_simple_rl.yaml \
&> /workspace/user_code/Qwen2.5-14B-Instruct.log5. Key Configuration Files
recipes/accelerate_configs/zero3.yaml (excerpt):
compute_environment: LOCAL_MACHINE
debug: false
deepspeed_config:
deepspeed_multinode_launcher: standard
offload_optimizer_device: "cpu"
offload_param_device: "cpu"
zero3_init_flag: true
zero3_save_16bit_model: true
zero_stage: 3
distributed_type: DEEPSPEED
mixed_precision: bf16
num_processes: 8
...recipes/Qwen2.5-14B-Instruct/grpo/config_simple_rl.yaml (excerpt):
# Model arguments
model_name_or_path: XXX/models/Qwen2.5-14B-Instruct
torch_dtype: bfloat16
attn_implementation: flash_attention_2
# Data arguments
dataset_name: XXX/dataset/data.json
num_processes: 7
# Trainer config
reward_funcs:
- accuracy_ours
- format
bf16: true
use_vllm: true
vllm_device: cuda:7
vllm_gpu_memory_utilization: 0.2
vllm_max_model_len: 8000
...Following these steps enables a smooth, reproducible local training of DeepSeek R1 on custom data using an 8‑GPU A100 cluster.
Tencent Technical Engineering
Official account of Tencent Technology. A platform for publishing and analyzing Tencent's technological innovations and cutting-edge developments.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.