Artificial Intelligence 12 min read

Deploying DeepSeek R1 671B Model Locally with Ollama and Dynamic Quantization

This guide explains how to deploy the full 671B DeepSeek R1 model on local hardware using Ollama, leveraging dynamic quantization to shrink model size, detailing hardware requirements, step‑by‑step installation, configuration, performance observations, and practical recommendations.

Architecture Digest

Feb 6, 2025

Deploying DeepSeek R1 671B Model Locally with Ollama and Dynamic Quantization

During the Chinese New Year, DeepSeek R1 gained popularity, and this article explains how to deploy the full 671B model locally using Ollama with dynamic quantization to reduce size.

Model selection : The original 720 GB model is impractical, so the author uses Unsloth AI’s dynamic‑quantized versions (1.73‑bit 158 GB and 4‑bit 404 GB) available on HuggingFace.

Hardware requirements : Minimum memory+GPU VRAM of ≥200 GB for the 1.73‑bit model and ≥500 GB for the 4‑bit model; the author’s test rig includes four RTX 4090 GPUs, 560 GB DDR5 RAM, and a 64‑core ThreadRipper.

Deployment steps (Linux) :

Download the .gguf file from HuggingFace and merge shards.

Install Ollama with curl -fsSL https://ollama.com/install.sh | sh.

Create a Modelfile (example for the 1.73‑bit model):

FROM /home/snowkylin/DeepSeek-R1-UD-IQ1_M.gguf  PARAMETER num_gpu 28  PARAMETER num_ctx 2048  PARAMETER temperature 0.6  TEMPLATE '<|User|>{{ .Prompt }}<|Assistant|>'

Build the model: ollama create DeepSeek-R1-UD-IQ1_M -f DeepSeekQ1_Modelfile.

Run with ollama run DeepSeek-R1-UD-IQ1_M --verbose and adjust num_gpu or num_ctx if memory errors occur.

Optional: install Open WebUI via pip install open-webui open-webui serve.

Observations : The 1.73‑bit model is faster and uses fewer resources than the 4‑bit version, both outperform distilled 8‑70B models, but CPU remains the bottleneck; token generation drops to 1‑2 tokens/s for long contexts.

Conclusion : For consumer‑grade hardware, the 1.73‑bit dynamic‑quantized model offers the best trade‑off of speed and quality, suitable for short‑text generation, while larger quantizations require more memory or cloud GPUs.

Additional notes cover merging shards with llama-gguf-split --merge …, changing Ollama’s model directory, enabling Flash Attention, and extending swap space.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Linux DeepSeek GPU Dynamic Quantization Ollama LLM deployment

Written by

Architecture Digest

Focusing on Java backend development, covering application architecture from top-tier internet companies (high availability, high performance, high stability), big data, machine learning, Java architecture, and other popular fields.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.