How to Deploy DeepSeek R1 671B Model Locally with Ollama: A Step‑by‑Step Guide
This article provides a comprehensive tutorial on locally deploying the 671‑billion‑parameter DeepSeek R1 model using Ollama, covering model selection, hardware requirements, dynamic quantization, detailed installation steps, performance observations, and practical recommendations for consumer‑grade hardware.
DeepSeek has become widely popular, and deploying its 671B MoE model locally enables customized usage. By applying targeted quantization techniques, the model size can be dramatically reduced, allowing deployment on consumer‑grade hardware such as a single Mac Studio.
Model Selection
The original DeepSeek R1 671B model weighs 720 GB, which is impractical for most users. This guide uses Unsloth AI’s dynamic‑quantization versions from HuggingFace to shrink the model size.
Dynamic quantization quantizes a few critical layers to 4‑6 bit and most MoE layers to 1‑2 bit, compressing the full model to as little as 131 GB (1.58‑bit), greatly lowering the deployment barrier.
DeepSeek‑R1‑UD‑IQ1_M (671B, 1.73‑bit, 158 GB, HuggingFace)
DeepSeek‑R1‑Q4_K_M (671B, 4‑bit, 404 GB, HuggingFace)
Unsloth AI offers four dynamic‑quantized variants ranging from 1.58 to 2.51 bit (131‑212 GB) for flexible hardware selection.
Hardware Requirements
Deploying such large models primarily challenges RAM + VRAM capacity. Recommended configurations are:
DeepSeek‑R1‑UD‑IQ1_M: combined memory ≥ 200 GB
DeepSeek‑R1‑Q4_K_M: combined memory ≥ 500 GB
Test environment: 4×RTX 4090 (24 GB each), 4‑channel DDR5‑5600 (96 GB total), ThreadRipper 7980X (64 cores).
Short‑text generation (~500 tokens) speeds: UD‑IQ1_M 7‑8 tokens/s (CPU‑only 4‑5 tokens/s); Q4_K_M 2‑4 tokens/s. Long‑text generation drops to 1‑2 tokens/s.
More cost‑effective alternatives include high‑memory Mac Studio, DDR5‑4800 servers, or cloud GPU servers with multiple 80 GB H100 GPUs.
Deployment Steps
1. Download model files
Download the .gguf files from HuggingFace (https://huggingface.co/unsloth/DeepSeek‑R1‑GGUF) and merge the split parts.
2. Install Ollama
curl -fsSL https://ollama.com/install.sh | sh3. Create Modelfile
Create a Modelfile for the chosen model, for example:
FROM /home/snowkylin/DeepSeek‑R1‑UD‑IQ1_M.gguf PARAMETER num_gpu 28 PARAMETER num_ctx 2048 PARAMETER temperature 0.6 TEMPLATE '<|User|>{{ .Prompt }}<|Assistant|>'Adjust the file path, num_gpu (GPU layers to load) and num_ctx (context window) as needed.
4. Create Ollama model
ollama create DeepSeek‑R1‑UD‑IQ1_M -f DeepSeekQ1_ModelfileEnsure Ollama’s model directory has sufficient space.
5. Run the model
ollama run DeepSeek‑R1‑UD‑IQ1_M --verboseUse --verbose to display token‑per‑second speed; adjust parameters if memory or CUDA errors occur.
6. (Optional) Install Web UI
pip install open-webui open-webui serveObservations
Both 1.73‑bit and 4‑bit full‑size models perform well on classic tasks; the 1.73‑bit version tends to produce more “edgy” responses, while the 4‑bit version often refuses provocative prompts.
The 1.73‑bit model occasionally generates slightly malformed markup.
CPU utilization is near full while GPU usage remains low (1‑3 %), indicating the bottleneck lies in CPU and memory bandwidth.
Conclusion and Recommendations
If the model cannot be fully loaded into VRAM, the 1.73‑bit dynamic‑quantized version offers better practicality—faster speed, lower resource consumption, and comparable quality to the 4‑bit version.
On consumer hardware, use the model for short, lightweight tasks; longer contexts dramatically reduce generation speed.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Data Thinking Notes
Sharing insights on data architecture, governance, and middle platforms, exploring AI in data, and linking data with business scenarios.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
