Artificial Intelligence 14 min read

How to Deploy DeepSeek R1 671B Model Locally with Ollama: A Step‑by‑Step Guide

This article provides a comprehensive tutorial on locally deploying the 671‑billion‑parameter DeepSeek R1 model using Ollama, covering model selection, hardware requirements, dynamic quantization, detailed installation steps, performance observations, and practical recommendations for consumer‑grade hardware.

Data Thinking Notes
Data Thinking Notes
Data Thinking Notes
How to Deploy DeepSeek R1 671B Model Locally with Ollama: A Step‑by‑Step Guide

DeepSeek has become widely popular, and deploying its 671B MoE model locally enables customized usage. By applying targeted quantization techniques, the model size can be dramatically reduced, allowing deployment on consumer‑grade hardware such as a single Mac Studio.

DeepSeek model diagram
DeepSeek model diagram

Model Selection

The original DeepSeek R1 671B model weighs 720 GB, which is impractical for most users. This guide uses Unsloth AI’s dynamic‑quantization versions from HuggingFace to shrink the model size.

Dynamic quantization quantizes a few critical layers to 4‑6 bit and most MoE layers to 1‑2 bit, compressing the full model to as little as 131 GB (1.58‑bit), greatly lowering the deployment barrier.

DeepSeek‑R1‑UD‑IQ1_M (671B, 1.73‑bit, 158 GB, HuggingFace)

DeepSeek‑R1‑Q4_K_M (671B, 4‑bit, 404 GB, HuggingFace)

Unsloth AI offers four dynamic‑quantized variants ranging from 1.58 to 2.51 bit (131‑212 GB) for flexible hardware selection.

Hardware Requirements

Deploying such large models primarily challenges RAM + VRAM capacity. Recommended configurations are:

DeepSeek‑R1‑UD‑IQ1_M: combined memory ≥ 200 GB

DeepSeek‑R1‑Q4_K_M: combined memory ≥ 500 GB

Test environment: 4×RTX 4090 (24 GB each), 4‑channel DDR5‑5600 (96 GB total), ThreadRipper 7980X (64 cores).

Short‑text generation (~500 tokens) speeds: UD‑IQ1_M 7‑8 tokens/s (CPU‑only 4‑5 tokens/s); Q4_K_M 2‑4 tokens/s. Long‑text generation drops to 1‑2 tokens/s.

More cost‑effective alternatives include high‑memory Mac Studio, DDR5‑4800 servers, or cloud GPU servers with multiple 80 GB H100 GPUs.

Deployment Steps

1. Download model files

Download the .gguf files from HuggingFace (https://huggingface.co/unsloth/DeepSeek‑R1‑GGUF) and merge the split parts.

2. Install Ollama

<code>curl -fsSL https://ollama.com/install.sh | sh</code>

3. Create Modelfile

Create a Modelfile for the chosen model, for example:

<code>FROM /home/snowkylin/DeepSeek‑R1‑UD‑IQ1_M.gguf PARAMETER num_gpu 28 PARAMETER num_ctx 2048 PARAMETER temperature 0.6 TEMPLATE '&lt;|User|&gt;{{ .Prompt }}&lt;|Assistant|&gt;'</code>

Adjust the file path,

num_gpu

(GPU layers to load) and

num_ctx

(context window) as needed.

4. Create Ollama model

<code>ollama create DeepSeek‑R1‑UD‑IQ1_M -f DeepSeekQ1_Modelfile</code>

Ensure Ollama’s model directory has sufficient space.

5. Run the model

<code>ollama run DeepSeek‑R1‑UD‑IQ1_M --verbose</code>

Use

--verbose

to display token‑per‑second speed; adjust parameters if memory or CUDA errors occur.

6. (Optional) Install Web UI

<code>pip install open-webui open-webui serve</code>

Observations

Both 1.73‑bit and 4‑bit full‑size models perform well on classic tasks; the 1.73‑bit version tends to produce more “edgy” responses, while the 4‑bit version often refuses provocative prompts.

The 1.73‑bit model occasionally generates slightly malformed markup.

CPU utilization is near full while GPU usage remains low (1‑3 %), indicating the bottleneck lies in CPU and memory bandwidth.

Conclusion and Recommendations

If the model cannot be fully loaded into VRAM, the 1.73‑bit dynamic‑quantized version offers better practicality—faster speed, lower resource consumption, and comparable quality to the 4‑bit version.

On consumer hardware, use the model for short, lightweight tasks; longer contexts dramatically reduce generation speed.

GPU inferenceDeepSeekdynamic quantizationOllamaLLM deploymentAI model optimization
Data Thinking Notes
Written by

Data Thinking Notes

Sharing insights on data architecture, governance, and middle platforms, exploring AI in data, and linking data with business scenarios.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.