How to Deploy Qwen3-8B on WSL2 with 4‑Bit Quantization and Resource Limits
This article details a step‑by‑step guide for setting up the Qwen3‑8B large language model on a Windows 11 system using WSL2, covering hardware specs, CUDA configuration, 4‑bit quantization with BitsAndBytes, SDPA attention optimization, CPU offload, and resource‑limiting tricks to achieve smooth inference performance.
Host Configuration
Hardware
CPU: AMD Ryzen 9 7900X (12 cores, 24 threads, 4.7 GHz base, 5.4 GHz boost)
Memory: 32 GB DDR5 6000 MHz (2 × 16 GB)
GPU: NVIDIA GeForce RTX 4060 with 16 GB GDDR6
Storage: 1 TB NVMe SSD (PCIe 4.0)
Software Environment
OS: Windows 11 Pro 23H2
WSL2: Ubuntu 22.04.4 LTS (kernel 5.15.146.1‑microsoft‑standard‑WSL2)
CUDA: 12.1 (built‑in to PyTorch)
Python: 3.10.12
Model: Qwen/Qwen3‑8B (≈8.03 B parameters)
WSL2 Resource Limiting
Create a .wslconfig file in the Windows user directory to restrict CPU cores, memory, and swap for WSL2:
[wsl2]
# Limit WSL2 to 8 CPU cores to avoid system stalls during AI inference
processors=8
# Allocate 16 GB RAM to WSL2
memory=16GB
# Provide 4 GB swap space
swap=4GBApply the changes by restarting WSL2:
wsl --shutdownEnvironment Preparation
Dependency Management
torch 2.10.0+cu128 – deep‑learning framework with CUDA support
transformers 4.43.3 – HuggingFace model loading and inference
accelerate 0.30.1 – distributed model loading and acceleration
bitsandbytes 0.42.0 – 4‑bit quantization support
chainlit 2.9.6 – web UI for interaction
psutil 5.10.0 – system resource monitoring
sentencepiece 0.1.99 – tokenizer support
protobuf 3.20.3 – data serialization
Install the dependencies using the Tsinghua mirror for speed:
pip install -r requirements.txt -i https://pypi.tuna.tsinghua.edu.cn/simpleModel Deployment and Optimization
SDPA Attention Optimization
PyTorch 2.0’s built‑in Scaled Dot‑Product Attention (SDPA) automatically selects the most efficient implementation, reduces VRAM usage, and speeds up inference without extra dependencies. Enable it via the model arguments:
model_kwargs = {
"trust_remote_code": True,
"quantization_config": quant_config,
"device_map": "auto",
"attn_implementation": "sdpa",
}4‑Bit Quantization
Use BitsAndBytes to quantize the model to 4‑bit, dramatically lowering VRAM consumption while preserving accuracy with NF4 and double quantization:
quant_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_quant_type="nf4", # NormalFloat4, better for normally‑distributed weights
bnb_4bit_use_double_quant=True, # Double quantization for extra savings
bnb_4bit_compute_dtype=torch.float16,
)CUDA Memory Fragmentation Mitigation
Enable PyTorch’s expandable segment allocator to reduce out‑of‑memory errors caused by fragmentation:
import os
os.environ["PYTORCH_CUDA_ALLOC_CONF"] = "expandable_segments:True"GPU Memory Reservation and CPU Resource Limits
Reserve a portion of GPU memory and limit CPU usage to keep the host responsive:
# GPU memory limit (use at most 95% of VRAM)
if torch.cuda.is_available():
torch.cuda.set_per_process_memory_fraction(0.95)
# CPU thread limit (use at most 80% of cores)
cpu_count = os.cpu_count() or 1
max_threads = max(1, int(cpu_count * 0.8))
torch.set_num_threads(max_threads)CPU Offload
Offload part of the model to system RAM, further reducing GPU pressure. The device_map="auto" setting lets accelerate distribute layers between CPU and GPU automatically:
quant_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_quant_type="nf4",
bnb_4bit_use_double_quant=True,
bnb_4bit_compute_dtype=torch.float16,
llm_int8_enable_fp32_cpu_offload=True, # Enable CPU offload
)CPU Affinity
Bind the inference process to a subset of cores to reduce context switches:
import psutil
p = psutil.Process()
cpus_to_use = available_cpus[:max_threads]
p.cpu_affinity(cpus_to_use)Performance Results
Inference speed: 15–25 tokens / s (depends on input length)
Thinking phase speed: 10–18 tokens / s
Answering phase speed: 18–28 tokens / s
Host remains responsive; no noticeable lag during AI inference
Repository
https://github.com/jxd134/qwen3-local-chat
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
