Artificial Intelligence 13 min read

Running AI/ML Models on WSL with CUDA Acceleration: A PyTorch Hands‑On Guide

This guide shows how to enable NVIDIA GPU passthrough in WSL 2, install the CUDA toolkit, set up a PyTorch GPU environment, verify GPU visibility, and run real‑world AI/ML workloads such as LLM inference, YOLO object detection, and Jupyter monitoring, while providing performance comparisons, optimization tips, and troubleshooting FAQs.

Ubuntu

Jun 15, 2026

Running AI/ML Models on WSL with CUDA Acceleration: A PyTorch Hands‑On Guide

WSL 2 GPU Support Overview

✅ NVIDIA GPUs: GTX 16/20/30/40 series, RTX 16/20/30/40/50 series, Tesla data‑center cards

✅ Supported frameworks: PyTorch (full), TensorFlow (full), JAX (partial), TensorFlow Lite, ONNX Runtime

✅ Typical applications: deep‑learning training & inference, large language models (e.g., Llama, Qwen), computer‑vision (YOLO, Stable Diffusion), data‑science computing

❌ AMD GPUs are not supported; ROCm requires native Linux or Docker.

Performance Comparison

CUDA compilation speed: identical to native Linux.

GPU inference throughput: 98‑102 % of native Linux performance.

GPU memory utilization: 93‑97 % of native Linux (native Linux 96 %+).

Multi‑GPU training: works out‑of‑the‑box in WSL 2, while Windows native requires extra configuration.

Docker GPU support: native in WSL 2, same as native Linux.

Step 1 – Pre‑flight Checks

In PowerShell run nvidia-smi to verify driver version (≥ 510.x) and the CUDA version reported by the driver. Example output shows driver version 550.00 and CUDA 12.4.

Driver version must be ≥ 510.x (latest recommended).

CUDA version displayed is the highest version the driver supports.

GPU memory size determines the maximum model size that can be run.

Verify GPU Visibility Inside WSL

# In WSL
nvidia-smi
# If the full GPU table appears, passthrough is active ✅
# If "command not found", install the driver.

Step 2 – Install CUDA Toolkit

Why the Toolkit Is Needed

nvidia-smi only needs the driver.
CUDA Toolkit provides:
- nvcc (compiler)
- cuDNN (deep‑learning acceleration)
- cuBLAS / cuFFT and other GPU libraries

Installation Methods

# Method 1: apt (Ubuntu recommended)
wget https://developer.download.nvidia.com/compute/cuda/repos/wsl-ubuntu/x86_64/cuda-keyring_1.1-1_all.deb
sudo dpkg -i cuda-keyring_1.1-1_all.deb
sudo apt update
sudo apt install cuda-toolkit-12-4   # adjust version as needed

# Method 2: runfile (more flexible)
# Download from https://developer.nvidia.com/cuda-downloads (WSL‑Ubuntu, runfile)
# Execute the installer script.

# Verify installation
nvcc --version

Configure Environment Variables

echo 'export PATH=/usr/local/cuda/bin:$PATH' >> ~/.bashrc
echo 'export LD_LIBRARY_PATH=/usr/local/cuda/lib64:$LD_LIBRARY_PATH' >> ~/.bashrc
source ~/.bashrc

Step 3 – Set Up a PyTorch GPU Environment

Create a Dedicated Virtual Environment

# Create AI/ML environment
python3 -m venv ~/ml-env
source ~/ml-env/bin/activate
pip install --upgrade pip

# Install PyTorch (GPU version)
# See https://pytorch.org/get-started/locally/ for the latest command
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu124

# Install common ML packages
pip install numpy pandas matplotlib seaborn plotly jupyterlab scikit-learn
pip install transformers accelerate bitsandbytes
pip install ultralytics   # YOLO

Verify PyTorch GPU Support

import torch
print(f"PyTorch version: {torch.__version__}")
print(f"CUDA available: {torch.cuda.is_available()}")
if torch.cuda.is_available():
    print(f"CUDA version: {torch.version.cuda}")
    print(f"GPU count: {torch.cuda.device_count()}")
    print(f"Current GPU: {torch.cuda.get_device_name(0)}")
    print(f"GPU memory: {torch.cuda.get_device_properties(0).total_mem / 1024**3:.1f} GB")
    # Simple compute test
    x = torch.randn(1000, 1000, device='cuda')
    y = torch.matmul(x, x.t())
    print(f"GPU matrix multiply: shape = {y.shape} ✅")
else:
    print("❌ CUDA not available!")

Running the script prints the PyTorch version, confirms CUDA availability, shows the GPU name (e.g., NVIDIA GeForce RTX 4060 Laptop GPU), memory (8 GB), and a successful matrix multiplication.

Step 4 – Real‑World Application Scenarios

Scenario 1 – Running a Small LLM

# Install Hugging Face Transformers
pip install transformers accelerate sentencepiece

# run-llm.py
from transformers import pipeline
import time
print("🔄 Loading model…")
start = time.time()
generator = pipeline('text-generation', model='Qwen/Qwen2.5-1.5B-Instruct', device_map='cuda')
load_time = time.time() - start
print(f"✅ Model loaded in {load_time:.1f}s")
prompt = "Explain what WSL is in simple terms."
result = generator(prompt, max_new_tokens=200, do_sample=True)[0]['generated_text']
print(result)

Memory requirement: ~3 GB for a 1.5 B‑parameter model; ~14 GB for a 7 B‑parameter model, which can be reduced with 4‑bit/8‑bit quantization via bitsandbytes.

Scenario 2 – YOLO Object Detection

# Install YOLO
pip install ultralytics opencv-python-headless

# detect.py
from ultralytics import YOLO
import cv2
print("🔄 Loading YOLOv8n…")
model = YOLO('yolov8n.pt')   # Nano version (~6 MB)
print("📸 Running inference (GPU)…")
results = model.predict(source=0, show=True, conf=0.5)
print("✅ Detection finished! Results saved in runs/detect/")

Scenario 3 – Jupyter Notebook GPU Monitoring

# Launch Jupyter Lab (WSLg allows direct browser access)
jupyter lab --ip=0.0.0.0 --port=8888 --no-browser

# In a notebook cell
!nvidia-smi
import torch
if torch.cuda.is_available():
    allocated = torch.cuda.memory_allocated() / 1024**3
    reserved = torch.cuda.memory_reserved() / 1024**3
    print(f"Allocated: {allocated:.2f} GB | Reserved: {reserved:.2f} GB")

Step 5 – Performance‑Optimization Tips

Tip 1 – CUDA Memory Strategy

import os
os.environ['PYTORCH_CUDA_ALLOC_CONF'] = 'expandable_segments:True'
import torch   # Enables dynamic CUDA memory pool expansion

Tip 2 – Mixed‑Precision Training (≈ 50 % Memory Savings)

from torch import amp
scaler = amp.GradScaler('cuda')
with amp.autocast('cuda'):
    output = model(input_data)
    loss = criterion(output, target)
scaler.scale(loss).backward()
scaler.step(optimizer)
scaler.update()

Tip 3 – Multi‑GPU Scaling

# Simple DataParallel
model = torch.nn.DataParallel(model).cuda()

# DistributedDataParallel for large‑scale training
import torch.distributed as dist
dist.init_process_group(backend='nccl')
model = torch.nn.parallel.DistributedDataParallel(model, device_ids=[0])

.wslconfig GPU Optimizations

[wsl2]
memory=16GB   # AI tasks often need more RAM
processors=10
swap=8GB
vmIdleTimeout=-1   # No special GPU config needed; passthrough works out‑of‑the‑box

Common FAQ

Q: nvidia-smi not found in WSL

1. Ensure Windows has NVIDIA driver ≥ 510.
2. Update WSL: wsl --update
3. Restart WSL: wsl --shutdown
4. Verify you are using WSL 2 (wsl --list -v shows VERSION).

Q: RuntimeError: CUDA out of memory

1. Reduce batch size.
2. Use gradient accumulation.
3. Enable mixed‑precision (AMP).
4. Clear cache: torch.cuda.empty_cache().
5. Close other GPU‑using programs.

Q: GPU in WSL slower than native Windows?

Performance is normally comparable. If slower:
1. Ensure model and data reside on CUDA.
2. Avoid accidental CPU fallback.
3. Disable WSLg GUI to free resources.
4. Stop unnecessary background services.

WSL 2 provides near‑native Linux GPU performance for AI/ML development on Windows, enabling high GPU utilization and flexible workflow integration.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Machine Learning AI CUDA GPU PyTorch WSL

Written by

Ubuntu

Focused on Ubuntu/Linux tech sharing, offering the latest news, practical tools, beginner tutorials, and problem solutions. Connecting open-source enthusiasts to build a Linux learning community. Join our QQ group or channel for discussion!

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.

WSL 2 GPU Support Overview

Performance Comparison

Step 1 – Pre‑flight Checks

Verify GPU Visibility Inside WSL

Step 2 – Install CUDA Toolkit

Why the Toolkit Is Needed

Installation Methods

Configure Environment Variables

Step 3 – Set Up a PyTorch GPU Environment

Create a Dedicated Virtual Environment

Verify PyTorch GPU Support

Step 4 – Real‑World Application Scenarios

Scenario 1 – Running a Small LLM

Scenario 2 – YOLO Object Detection

Scenario 3 – Jupyter Notebook GPU Monitoring

Step 5 – Performance‑Optimization Tips

Tip 1 – CUDA Memory Strategy

Tip 2 – Mixed‑Precision Training (≈ 50 % Memory Savings)

Tip 3 – Multi‑GPU Scaling

.wslconfig GPU Optimizations

Common FAQ

Q: nvidia-smi not found in WSL

Q: RuntimeError: CUDA out of memory

Q: GPU in WSL slower than native Windows?

Ubuntu

How this landed with the community

Was this worth your time?

0 Comments

WSL 2 GPU Support Overview

Step 1 – Pre‑flight Checks

Step 2 – Install CUDA Toolkit

Step 3 – Set Up a PyTorch GPU Environment

Step 4 – Real‑World Application Scenarios

Scenario 1 – Running a Small LLM

Scenario 2 – YOLO Object Detection

Scenario 3 – Jupyter Notebook GPU Monitoring

Step 5 – Performance‑Optimization Tips

Tip 1 – CUDA Memory Strategy

Tip 2 – Mixed‑Precision Training (≈ 50 % Memory Savings)

Tip 3 – Multi‑GPU Scaling