Artificial Intelligence 9 min read

Step-by-Step Guide to Local Training of DeepSeek R1 on Multi‑GPU A100 Systems

This step‑by‑step tutorial shows how to set up CUDA 12.4, install required packages, prepare a JSON dataset and custom reward, troubleshoot out‑of‑memory errors, and launch DeepSeek R1 training on an 8‑GPU A100 cluster using Accelerate, Deepspeed zero‑3 and vLLM configurations.

Tencent Technical Engineering

Mar 31, 2025

Step-by-Step Guide to Local Training of DeepSeek R1 on Multi‑GPU A100 Systems

This article provides a comprehensive, hands‑on tutorial for training the DeepSeek R1 large language model locally, focusing on the practical challenges of environment setup, CUDA compatibility, and common pitfalls such as out‑of‑memory (OOM) errors.

1. Environment Setup

GPU driver and CUDA version : open‑r1 requires CUDA 12.4. Verify your driver version (≥ 470) and upgrade if necessary. Example check:

# 查看自己的显卡版本与cuda是否适配
import torch
print(torch.cuda.is_available())  # True indicates compatibility

Quick installation using Conda:

1. conda create -n openr1 python=3.11
2. pip install torch==2.5.1 torchvision==0.20.1 torchaudio==2.5.1 --index-url https://download.pytorch.org/whl/cu124
3. pip install vllm==0.7.2
4. pip install flash-attn
5. cd open-r1 && pip install -e "[dev]"

2. Training Pitfalls and OOM Debugging

When training the Qwen‑14B model on 8×A100 (40 GB), OOM can arise in two places:

Training phase (7 GPUs) – fix by editing recipes/accelerate_configs/zero3.yaml and enabling optimizer/offload offload.

Inference phase (1 GPU) – if using vllm < 0.7.3, lower vllm_gpu_memory_utilization (e.g., 0.2 for 14B, 0.5 for 7B).

Too large vllm_max_model_len – reduce to 4k‑8k depending on prompt+output length.

Identify the source of OOM by checking the GPU index in the error message; the last GPU (e.g., GPU 7) is typically used for inference.

3. Data Preparation and Reward Function

Prepare a JSON dataset data.json with fields problem and solution matching the model’s expected schema:

{"problem": "Classify the text into neutral, negative, or positive
Text: I think the food was okay.
Sentiment:
", "solution": "positive"}
{"problem": "Classify the text into neutral, negative, or positive
Text: I think the food was shit.
Sentiment:
", "solution": "negative"}

Modify grpo.py to load the offline dataset instead of Hub data:

dataset = load_dataset("json", data_files=XXX/data.json)
dataset = dataset["train"].train_test_split(test_size=0.02)

Custom reward function must keep argument names consistent with dataset columns. Example:

def accuracy_reward_ours(completions, solution, **kwargs):
    """Reward function that checks if the completion matches the ground truth."""
    contents = [c["content"] for c in completions]
    rewards = []
    for content, sol in zip(contents, solution):
        gold = sol
        if len(gold) != 0:
            answer = re.findall("<answer>(.*?)</answer>", content)
            if len(answer) > 0:
                answer = answer[0]
                reward = float(1 if answer == gold else 0)
            else:
                reward = float(0)
        else:
            reward = 1.0
            print("Failed to parse gold solution: ", sol)
        rewards.append(reward)
    return rewards

4. Launch Command

Run the training with Accelerate and the prepared configuration files:

ACCELERATE_LOG_LEVEL=info accelerate launch --config_file recipes/accelerate_configs/zero3.yaml \
    --num_processes=7 src/open_r1/grpo.py \
    --config recipes/Qwen2.5-14B-Instruct/grpo/config_simple_rl.yaml \
    &> /workspace/user_code/Qwen2.5-14B-Instruct.log

5. Key Configuration Files

recipes/accelerate_configs/zero3.yaml (excerpt):

compute_environment: LOCAL_MACHINE
debug: false
deepspeed_config:
  deepspeed_multinode_launcher: standard
  offload_optimizer_device: "cpu"
  offload_param_device: "cpu"
  zero3_init_flag: true
  zero3_save_16bit_model: true
  zero_stage: 3
distributed_type: DEEPSPEED
mixed_precision: bf16
num_processes: 8
...

recipes/Qwen2.5-14B-Instruct/grpo/config_simple_rl.yaml (excerpt):

# Model arguments
model_name_or_path: XXX/models/Qwen2.5-14B-Instruct
torch_dtype: bfloat16
attn_implementation: flash_attention_2

# Data arguments
dataset_name: XXX/dataset/data.json
num_processes: 7

# Trainer config
reward_funcs:
- accuracy_ours
- format
bf16: true
use_vllm: true
vllm_device: cuda:7
vllm_gpu_memory_utilization: 0.2
vllm_max_model_len: 8000
...

Following these steps enables a smooth, reproducible local training of DeepSeek R1 on custom data using an 8‑GPU A100 cluster.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

CUDA DeepSeek PyTorch A100 LLM training Reward Function

Written by

Tencent Technical Engineering

Official account of Tencent Technology. A platform for publishing and analyzing Tencent's technological innovations and cutting-edge developments.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.