Artificial Intelligence 14 min read

How to Deploy DeepSeek R1 671B Model Locally with Ollama: A Step‑by‑Step Guide

This article provides a comprehensive tutorial on locally deploying the 671‑billion‑parameter DeepSeek R1 model using Ollama, covering model selection, hardware requirements, dynamic quantization, detailed installation steps, performance observations, and practical recommendations for consumer‑grade hardware.

Data Thinking Notes

Feb 20, 2025

How to Deploy DeepSeek R1 671B Model Locally with Ollama: A Step‑by‑Step Guide

DeepSeek has become widely popular, and deploying its 671B MoE model locally enables customized usage. By applying targeted quantization techniques, the model size can be dramatically reduced, allowing deployment on consumer‑grade hardware such as a single Mac Studio.

Model Selection

The original DeepSeek R1 671B model weighs 720 GB, which is impractical for most users. This guide uses Unsloth AI’s dynamic‑quantization versions from HuggingFace to shrink the model size.

Dynamic quantization quantizes a few critical layers to 4‑6 bit and most MoE layers to 1‑2 bit, compressing the full model to as little as 131 GB (1.58‑bit), greatly lowering the deployment barrier.

DeepSeek‑R1‑UD‑IQ1_M (671B, 1.73‑bit, 158 GB, HuggingFace)

DeepSeek‑R1‑Q4_K_M (671B, 4‑bit, 404 GB, HuggingFace)

Unsloth AI offers four dynamic‑quantized variants ranging from 1.58 to 2.51 bit (131‑212 GB) for flexible hardware selection.

Hardware Requirements

Deploying such large models primarily challenges RAM + VRAM capacity. Recommended configurations are:

DeepSeek‑R1‑UD‑IQ1_M: combined memory ≥ 200 GB

DeepSeek‑R1‑Q4_K_M: combined memory ≥ 500 GB

Test environment: 4×RTX 4090 (24 GB each), 4‑channel DDR5‑5600 (96 GB total), ThreadRipper 7980X (64 cores).

Short‑text generation (~500 tokens) speeds: UD‑IQ1_M 7‑8 tokens/s (CPU‑only 4‑5 tokens/s); Q4_K_M 2‑4 tokens/s. Long‑text generation drops to 1‑2 tokens/s.

More cost‑effective alternatives include high‑memory Mac Studio, DDR5‑4800 servers, or cloud GPU servers with multiple 80 GB H100 GPUs.

Deployment Steps

1. Download model files

Download the .gguf files from HuggingFace (https://huggingface.co/unsloth/DeepSeek‑R1‑GGUF) and merge the split parts.

2. Install Ollama

curl -fsSL https://ollama.com/install.sh | sh

3. Create Modelfile

Create a Modelfile for the chosen model, for example:

FROM /home/snowkylin/DeepSeek‑R1‑UD‑IQ1_M.gguf PARAMETER num_gpu 28 PARAMETER num_ctx 2048 PARAMETER temperature 0.6 TEMPLATE '<｜User｜>{{ .Prompt }}<｜Assistant｜>'

Adjust the file path, num_gpu (GPU layers to load) and num_ctx (context window) as needed.

4. Create Ollama model

ollama create DeepSeek‑R1‑UD‑IQ1_M -f DeepSeekQ1_Modelfile

Ensure Ollama’s model directory has sufficient space.

5. Run the model

ollama run DeepSeek‑R1‑UD‑IQ1_M --verbose

Use --verbose to display token‑per‑second speed; adjust parameters if memory or CUDA errors occur.

6. (Optional) Install Web UI

pip install open-webui open-webui serve

Observations

Both 1.73‑bit and 4‑bit full‑size models perform well on classic tasks; the 1.73‑bit version tends to produce more “edgy” responses, while the 4‑bit version often refuses provocative prompts.

The 1.73‑bit model occasionally generates slightly malformed markup.

CPU utilization is near full while GPU usage remains low (1‑3 %), indicating the bottleneck lies in CPU and memory bandwidth.

Conclusion and Recommendations

If the model cannot be fully loaded into VRAM, the 1.73‑bit dynamic‑quantized version offers better practicality—faster speed, lower resource consumption, and comparable quality to the 4‑bit version.

On consumer hardware, use the model for short, lightweight tasks; longer contexts dramatically reduce generation speed.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

GPU inference DeepSeek Dynamic Quantization Ollama LLM deployment AI model optimization

Written by

Data Thinking Notes

Sharing insights on data architecture, governance, and middle platforms, exploring AI in data, and linking data with business scenarios.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.