How to Deploy Qwen3-8B on WSL2 with 4‑Bit Quantization and Resource Limits

This article details a step‑by‑step guide for setting up the Qwen3‑8B large language model on a Windows 11 system using WSL2, covering hardware specs, CUDA configuration, 4‑bit quantization with BitsAndBytes, SDPA attention optimization, CPU offload, and resource‑limiting tricks to achieve smooth inference performance.

Tech Musings
Tech Musings
Tech Musings
How to Deploy Qwen3-8B on WSL2 with 4‑Bit Quantization and Resource Limits

Host Configuration

Hardware

CPU: AMD Ryzen 9 7900X (12 cores, 24 threads, 4.7 GHz base, 5.4 GHz boost)

Memory: 32 GB DDR5 6000 MHz (2 × 16 GB)

GPU: NVIDIA GeForce RTX 4060 with 16 GB GDDR6

Storage: 1 TB NVMe SSD (PCIe 4.0)

Software Environment

OS: Windows 11 Pro 23H2

WSL2: Ubuntu 22.04.4 LTS (kernel 5.15.146.1‑microsoft‑standard‑WSL2)

CUDA: 12.1 (built‑in to PyTorch)

Python: 3.10.12

Model: Qwen/Qwen3‑8B (≈8.03 B parameters)

WSL2 Resource Limiting

Create a .wslconfig file in the Windows user directory to restrict CPU cores, memory, and swap for WSL2:

[wsl2]
# Limit WSL2 to 8 CPU cores to avoid system stalls during AI inference
processors=8
# Allocate 16 GB RAM to WSL2
memory=16GB
# Provide 4 GB swap space
swap=4GB

Apply the changes by restarting WSL2:

wsl --shutdown

Environment Preparation

Dependency Management

torch 2.10.0+cu128 – deep‑learning framework with CUDA support

transformers 4.43.3 – HuggingFace model loading and inference

accelerate 0.30.1 – distributed model loading and acceleration

bitsandbytes 0.42.0 – 4‑bit quantization support

chainlit 2.9.6 – web UI for interaction

psutil 5.10.0 – system resource monitoring

sentencepiece 0.1.99 – tokenizer support

protobuf 3.20.3 – data serialization

Install the dependencies using the Tsinghua mirror for speed:

pip install -r requirements.txt -i https://pypi.tuna.tsinghua.edu.cn/simple

Model Deployment and Optimization

SDPA Attention Optimization

PyTorch 2.0’s built‑in Scaled Dot‑Product Attention (SDPA) automatically selects the most efficient implementation, reduces VRAM usage, and speeds up inference without extra dependencies. Enable it via the model arguments:

model_kwargs = {
    "trust_remote_code": True,
    "quantization_config": quant_config,
    "device_map": "auto",
    "attn_implementation": "sdpa",
}

4‑Bit Quantization

Use BitsAndBytes to quantize the model to 4‑bit, dramatically lowering VRAM consumption while preserving accuracy with NF4 and double quantization:

quant_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",  # NormalFloat4, better for normally‑distributed weights
    bnb_4bit_use_double_quant=True,  # Double quantization for extra savings
    bnb_4bit_compute_dtype=torch.float16,
)

CUDA Memory Fragmentation Mitigation

Enable PyTorch’s expandable segment allocator to reduce out‑of‑memory errors caused by fragmentation:

import os
os.environ["PYTORCH_CUDA_ALLOC_CONF"] = "expandable_segments:True"

GPU Memory Reservation and CPU Resource Limits

Reserve a portion of GPU memory and limit CPU usage to keep the host responsive:

# GPU memory limit (use at most 95% of VRAM)
if torch.cuda.is_available():
    torch.cuda.set_per_process_memory_fraction(0.95)

# CPU thread limit (use at most 80% of cores)
cpu_count = os.cpu_count() or 1
max_threads = max(1, int(cpu_count * 0.8))
torch.set_num_threads(max_threads)

CPU Offload

Offload part of the model to system RAM, further reducing GPU pressure. The device_map="auto" setting lets accelerate distribute layers between CPU and GPU automatically:

quant_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_use_double_quant=True,
    bnb_4bit_compute_dtype=torch.float16,
    llm_int8_enable_fp32_cpu_offload=True,  # Enable CPU offload
)

CPU Affinity

Bind the inference process to a subset of cores to reduce context switches:

import psutil
p = psutil.Process()
cpus_to_use = available_cpus[:max_threads]
p.cpu_affinity(cpus_to_use)

Performance Results

Inference speed: 15–25 tokens / s (depends on input length)

Thinking phase speed: 10–18 tokens / s

Answering phase speed: 18–28 tokens / s

Host remains responsive; no noticeable lag during AI inference

Repository

https://github.com/jxd134/qwen3-local-chat
PyTorchWSL2CUDA optimization4-bit quantizationQwen3-8BSDPA
Tech Musings
Written by

Tech Musings

Capturing thoughts and reflections while coding.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.