Artificial Intelligence 14 min read

Deploying DeepSeek R1 671B Model Locally with Ollama: Quantization, Hardware Requirements, and Step‑by‑Step Guide

This article provides a comprehensive tutorial on locally deploying the full‑size DeepSeek R1 671B model using Ollama, covering dynamic quantization options, hardware specifications, detailed installation commands, configuration files, performance observations, and practical recommendations for consumer‑grade systems.

Top Architect

Feb 6, 2025

Deploying DeepSeek R1 671B Model Locally with Ollama: Quantization, Hardware Requirements, and Step‑by‑Step Guide

Model Selection

The original DeepSeek R1 671B model weighs about 720 GB, which is impractical for most users. This guide uses Unsloth AI’s dynamic‑quantized versions to shrink the model dramatically, making local deployment feasible.

Dynamic quantization applies high‑quality 4‑6 bit quantization to a few critical layers while aggressively 1‑2 bit quantizing the majority of MoE layers, reducing the model to as little as 131 GB (1.58‑bit).

Two models were tested:

DeepSeek‑R1‑UD‑IQ1_M (671B, 1.73‑bit dynamic quantization, 158 GB, HuggingFace)

DeepSeek‑R1‑Q4_K_M (671B, 4‑bit standard quantization, 404 GB, HuggingFace)

Hardware Requirements

Deploying such large models is limited by combined RAM + VRAM. Recommended configurations:

DeepSeek‑R1‑UD‑IQ1_M: ≥ 200 GB total memory

DeepSeek‑R1‑Q4_K_M: ≥ 500 GB total memory

The author’s test rig consisted of four RTX 4090 GPUs (4 × 24 GB VRAM), four‑channel DDR5‑5600 RAM (4 × 96 GB), and a ThreadRipper 7980X CPU (64 cores).

On this setup, short‑text generation (~500 tokens) achieved:

UD‑IQ1_M: 7‑8 tokens / s (CPU‑only 4‑5 tokens / s)

Q4_K_M: 2‑4 tokens / s

Long‑text generation drops to 1‑2 tokens / s. More cost‑effective alternatives include a single Mac Studio with 192 GB unified memory or cloud GPU instances (e.g., dual NVIDIA H100 80 GB).

Deployment Steps (Linux example)

Download the .gguf model file from HuggingFace (e.g., https://huggingface.co/unsloth/DeepSeek-R1-GGUF ) and merge split parts.

Install Ollama: curl -fsSL https://ollama.com/install.sh | sh Create a Modelfile for the chosen model. Example for UD‑IQ1_M:

FROM /home/snowkylin/DeepSeek-R1-UD-IQ1_M.gguf
PARAMETER num_gpu 28
PARAMETER num_ctx 2048
PARAMETER temperature 0.6
TEMPLATE "<｜User｜>{{ .Prompt }}<｜Assistant｜>"

Adjust the file path, num_gpu , and num_ctx as needed.

Create the Ollama model:

ollama create DeepSeek-R1-UD-IQ1_M -f DeepSeekQ1_Modelfile

Run the model with verbose output to see token‑per‑second speed: ollama run DeepSeek-R1-UD-IQ1_M --verbose Optional: install a web UI (Open WebUI) for easier interaction:

pip install open-webui
open-webui serve

Empirical Observations

The 1.73‑bit and 4‑bit full‑size models both perform well on classic tasks, but the 1.73‑bit version is more “edgy” and tends to produce more unrestricted responses, while the 4‑bit version is more conservative and polite.

Occasional formatting glitches (e.g., unclosed <think> tags) were noted in the 1.73‑bit output. CPU utilization is near‑full while GPU usage remains low (1‑3 %), indicating the bottleneck lies in CPU and memory bandwidth.

Conclusion & Recommendations

If the model cannot fit entirely into GPU memory, the 1.73‑bit dynamic‑quantized version offers better speed and lower resource consumption without a noticeable quality drop compared to the 4‑bit version.

For consumer‑grade hardware, use the model for short, lightweight tasks (single‑turn dialogs, brief text generation). As context length grows, generation speed degrades sharply.

Readers are encouraged to share their deployment experiences and questions in the comments.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

AI LLM Quantization DeepSeek GPU local deployment Ollama

Written by

Top Architect

Top Architect focuses on sharing practical architecture knowledge, covering enterprise, system, website, large‑scale distributed, and high‑availability architectures, plus architecture adjustments using internet technologies. We welcome idea‑driven, sharing‑oriented architects to exchange and learn together.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.