Deploying DeepSeek R1 671B Model Locally with Ollama and Dynamic Quantization
This guide explains how to deploy the full 671B DeepSeek R1 model on local hardware using Ollama, leveraging dynamic quantization to shrink model size, detailing hardware requirements, step‑by‑step installation, configuration, performance observations, and practical recommendations.
During the Chinese New Year, DeepSeek R1 gained popularity, and this article explains how to deploy the full 671B model locally using Ollama with dynamic quantization to reduce size.
Model selection : The original 720 GB model is impractical, so the author uses Unsloth AI’s dynamic‑quantized versions (1.73‑bit 158 GB and 4‑bit 404 GB) available on HuggingFace.
Hardware requirements : Minimum memory+GPU VRAM of ≥200 GB for the 1.73‑bit model and ≥500 GB for the 4‑bit model; the author’s test rig includes four RTX 4090 GPUs, 560 GB DDR5 RAM, and a 64‑core ThreadRipper.
Deployment steps (Linux) :
Download the .gguf file from HuggingFace and merge shards.
Install Ollama with curl -fsSL https://ollama.com/install.sh | sh .
Create a Modelfile (example for the 1.73‑bit model): FROM /home/snowkylin/DeepSeek-R1-UD-IQ1_M.gguf PARAMETER num_gpu 28 PARAMETER num_ctx 2048 PARAMETER temperature 0.6 TEMPLATE '<|User|>{{ .Prompt }}<|Assistant|>' .
Build the model: ollama create DeepSeek-R1-UD-IQ1_M -f DeepSeekQ1_Modelfile .
Run with ollama run DeepSeek-R1-UD-IQ1_M --verbose and adjust num_gpu or num_ctx if memory errors occur.
Optional: install Open WebUI via pip install open-webui open-webui serve .
Observations : The 1.73‑bit model is faster and uses fewer resources than the 4‑bit version, both outperform distilled 8‑70B models, but CPU remains the bottleneck; token generation drops to 1‑2 tokens/s for long contexts.
Conclusion : For consumer‑grade hardware, the 1.73‑bit dynamic‑quantized model offers the best trade‑off of speed and quality, suitable for short‑text generation, while larger quantizations require more memory or cloud GPUs.
Additional notes cover merging shards with llama-gguf-split --merge … , changing Ollama’s model directory, enabling Flash Attention, and extending swap space.
Architecture Digest
Focusing on Java backend development, covering application architecture from top-tier internet companies (high availability, high performance, high stability), big data, machine learning, Java architecture, and other popular fields.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.