How to Deploy DeepSeek R1 671B Model Locally with Ollama: A Step‑by‑Step Guide
This article provides a comprehensive tutorial on locally deploying the 671‑billion‑parameter DeepSeek R1 model using Ollama, covering model selection, hardware requirements, dynamic quantization, detailed installation steps, performance observations, and practical recommendations for consumer‑grade hardware.
DeepSeek has become widely popular, and deploying its 671B MoE model locally enables customized usage. By applying targeted quantization techniques, the model size can be dramatically reduced, allowing deployment on consumer‑grade hardware such as a single Mac Studio.
Model Selection
The original DeepSeek R1 671B model weighs 720 GB, which is impractical for most users. This guide uses Unsloth AI’s dynamic‑quantization versions from HuggingFace to shrink the model size.
Dynamic quantization quantizes a few critical layers to 4‑6 bit and most MoE layers to 1‑2 bit, compressing the full model to as little as 131 GB (1.58‑bit), greatly lowering the deployment barrier.
DeepSeek‑R1‑UD‑IQ1_M (671B, 1.73‑bit, 158 GB, HuggingFace)
DeepSeek‑R1‑Q4_K_M (671B, 4‑bit, 404 GB, HuggingFace)
Unsloth AI offers four dynamic‑quantized variants ranging from 1.58 to 2.51 bit (131‑212 GB) for flexible hardware selection.
Hardware Requirements
Deploying such large models primarily challenges RAM + VRAM capacity. Recommended configurations are:
DeepSeek‑R1‑UD‑IQ1_M: combined memory ≥ 200 GB
DeepSeek‑R1‑Q4_K_M: combined memory ≥ 500 GB
Test environment: 4×RTX 4090 (24 GB each), 4‑channel DDR5‑5600 (96 GB total), ThreadRipper 7980X (64 cores).
Short‑text generation (~500 tokens) speeds: UD‑IQ1_M 7‑8 tokens/s (CPU‑only 4‑5 tokens/s); Q4_K_M 2‑4 tokens/s. Long‑text generation drops to 1‑2 tokens/s.
More cost‑effective alternatives include high‑memory Mac Studio, DDR5‑4800 servers, or cloud GPU servers with multiple 80 GB H100 GPUs.
Deployment Steps
1. Download model files
Download the .gguf files from HuggingFace (https://huggingface.co/unsloth/DeepSeek‑R1‑GGUF) and merge the split parts.
2. Install Ollama
<code>curl -fsSL https://ollama.com/install.sh | sh</code>3. Create Modelfile
Create a Modelfile for the chosen model, for example:
<code>FROM /home/snowkylin/DeepSeek‑R1‑UD‑IQ1_M.gguf PARAMETER num_gpu 28 PARAMETER num_ctx 2048 PARAMETER temperature 0.6 TEMPLATE '<|User|>{{ .Prompt }}<|Assistant|>'</code>Adjust the file path,
num_gpu(GPU layers to load) and
num_ctx(context window) as needed.
4. Create Ollama model
<code>ollama create DeepSeek‑R1‑UD‑IQ1_M -f DeepSeekQ1_Modelfile</code>Ensure Ollama’s model directory has sufficient space.
5. Run the model
<code>ollama run DeepSeek‑R1‑UD‑IQ1_M --verbose</code>Use
--verboseto display token‑per‑second speed; adjust parameters if memory or CUDA errors occur.
6. (Optional) Install Web UI
<code>pip install open-webui open-webui serve</code>Observations
Both 1.73‑bit and 4‑bit full‑size models perform well on classic tasks; the 1.73‑bit version tends to produce more “edgy” responses, while the 4‑bit version often refuses provocative prompts.
The 1.73‑bit model occasionally generates slightly malformed markup.
CPU utilization is near full while GPU usage remains low (1‑3 %), indicating the bottleneck lies in CPU and memory bandwidth.
Conclusion and Recommendations
If the model cannot be fully loaded into VRAM, the 1.73‑bit dynamic‑quantized version offers better practicality—faster speed, lower resource consumption, and comparable quality to the 4‑bit version.
On consumer hardware, use the model for short, lightweight tasks; longer contexts dramatically reduce generation speed.
Data Thinking Notes
Sharing insights on data architecture, governance, and middle platforms, exploring AI in data, and linking data with business scenarios.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.