6 min read

DeepSeek‑R1 Model Performance: Comparing 32B, 70B, and R1

This article evaluates DeepSeek‑R1’s 32B and 70B distilled models alongside the original R1 on a range of reasoning and coding tasks, detailing hardware setup, test methodology, per‑task results, and a comparative analysis of their strengths and weaknesses.

AI Algorithm Path

Mar 3, 2025

DeepSeek‑R1 Model Performance: Comparing 32B, 70B, and R1

Hardware configuration

Tests were executed on a Windows Subsystem for Linux 2 (WSL2) host equipped with an Intel i7‑14700KF (3.4 GHz), 32 GB RAM and an NVIDIA RTX 4090 GPU.

DeepSeek‑R1‑Distill‑Qwen‑32B runs without system modifications.

DeepSeek‑R1‑Distill‑Llama‑70B requires the GPU memory limit to be set to 24 GB and uses psutil to monitor memory before execution.

Test methodology

Following the benchmark set by Matthew Berman, a suite of questions was run on each model. The suite includes:

Word‑counting (count the letter “r” in “strawberry”).

Python implementation of a Snake game.

Python implementation of a Tetris game.

Envelope‑size validation (14 × 9 cm – 32.4 × 22.9 cm vs. 200 mm × 275 mm).

Prompt‑length word count.

Four logical‑reasoning puzzles (killer count, marble location, comparative number size, etc.).

Each model’s answer was recorded as correct (✅) or incorrect (❌) and the level of detail was noted.

Results

Letter‑r count in “strawberry” : 32B ✅ (detail similar to R1), 70B ✅ (correct but less detailed), R1 ✅ (detailed).

Snake game (Python) : 32B ❌ (fails to eat fruit), 70B ✅ (correct behavior, score updates), R1 ✅ (matches 70B).

Tetris game (Python) : 32B ❌ (static output), 70B ❌ (partial movement), R1 ✅ (fully functional).

Envelope size validation : 32B ❌ (answers “no”), 70B ✅ (answers “yes” with unit conversion), R1 ✅ (answers “yes” with conversion).

Prompt‑length word count : 32B ✅ (reasoning similar to R1), 70B ✅ (concise), R1 ✅ (detailed).

Logic puzzle – remaining killers : 32B ✅ (similar to R1), 70B ✅ (correct but brief), R1 ✅ (highly detailed).

Logic puzzle – marble location : 32B ✅ (similar to R1), 70B ✅ (sufficient), R1 ✅ (detailed).

Logic puzzle – larger number (9.11 vs 9.9) : 32B ✅ (detailed), 70B ✅ (concise), R1 ✅ (detailed).

Analysis

The original DeepSeek‑R1 consistently outperforms both distilled variants on coding tasks (Snake, Tetris) and provides richer reasoning. The 32B model, built on Qwen, generates more elaborate explanations but often fails to produce executable code. The 70B model, built on Llama, achieves higher coding correctness and factual accuracy, though its reasoning depth is lower and response speed is slower.

Conclusion

DeepSeek‑R1 (original) delivers superior performance on both programming and logical‑reasoning benchmarks. The 70B distilled model offers a balance of coding ability and factual correctness at the cost of speed and reasoning depth. The 32B distilled model excels in detailed reasoning but struggles with functional code generation.

Github: https://github.com/entzyeung/towardsai/tree/main/Comparing%20DeepSeek-R1%20Models%2032B%20vs%2070B%20vs%20R1

DeepSeek model comparison R1 LLM evaluation 32B 70B reasoning benchmarks

Written by

AI Algorithm Path

A public account focused on deep learning, computer vision, and autonomous driving perception algorithms, covering visual CV, neural networks, pattern recognition, related hardware and software configurations, and open-source projects.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.