How to Set Up WSL2 GPU Acceleration and Profile CUDA on Windows 11

This guide walks through configuring Windows 11 with WSL2 and Ubuntu 22.04 for GPU‑accelerated CUDA development, installing NVIDIA drivers and CUDA libraries, setting up SSH and firewall rules, running a CUDA stress‑test program, and using Nsight Systems, Nsight Compute, and NVIDIA DCGM for performance profiling and monitoring.

AI Cyberspace
AI Cyberspace
AI Cyberspace
How to Set Up WSL2 GPU Acceleration and Profile CUDA on Windows 11

Environment Information

Win11 (recommended) provides GPU device

WSL2 Ubuntu 22.04 provides CUDA runtime

MacOS is used as a programming and performance‑analysis client

Enable WSL and Virtual Machine Platform

Open PowerShell as Administrator

Install WSL (https://learn.microsoft.com/en-us/windows/wsl/install)

Enable the WSL feature

dism.exe /online /enable-feature /featurename:Microsoft-Windows-Subsystem-Linux /all /norestart

Enable the Virtual Machine Platform feature

dism.exe /online /enable-feature /featurename:VirtualMachinePlatform /all /norestart

Restart the computer

Install Ubuntu on Win11

Open Microsoft Store, search for Ubuntu and install the latest LTS version

Launch Ubuntu from the Start menu; the first run may prompt a WSL version update and ask for a username and password

Open the WSL settings (Start → WSLsetting) to adjust CPU, memory, network, and disk configuration

Configure WSL Network for LAN Access

By default WSL uses NAT (IP in 172.x.x.x). For LAN SSH you can switch to Mirrored mode so the Ubuntu IP matches the host IP, allowing direct LAN access.

$ ip a

Configure Ubuntu SSH

Remove the initial ssh server sudo apt remove openssh-server Update the OS sudo apt update Re‑install ssh sudo apt install openssh-server Edit /etc/ssh/sshd_config to change the port and enable root login

Port 2223
PermitRootLogin yes
PasswordAuthentication yes

Restart and enable the ssh service

sudo service ssh restart
sudo systemctl ssh enable

Open Win11 Firewall for SSH

Add a port‑proxy (only needed for NAT mode)

netsh interface portproxy add v4tov4 listenport=2223 listenaddress=0.0.0.0 connectport=2223 connectaddress=<WSL2_IP_Address>

Create an inbound firewall rule

netsh advfirewall firewall add rule name=WSL2 dir=in action=allow protocol=TCP localport=2223

Linux CUDA on a Windows GPU Machine

Since May 2020 Microsoft provides GPU‑accelerated WSL2. The GPU driver lives in Windows, while the CUDA runtime libraries are installed inside the Linux distribution.

CUDA stack on WSL2
CUDA stack on WSL2

Install NVIDIA Driver on Win11

Download and install the NVIDIA App (Game Ready or Studio). The driver is automatically exposed to WSL as libcuda.so.

Install CUDA Libraries in Ubuntu

# Add NVIDIA repository
wget https://developer.download.nvidia.com/compute/cuda/repos/wsl-ubuntu/x86_64/cuda-wsl-ubuntu.pin
sudo mv cuda-wsl-ubuntu.pin /etc/apt/preferences.d/cuda-repository-pin-600
# Add key
sudo apt-key adv --fetch-keys https://developer.download.nvidia.com/compute/cuda/repos/wsl-ubuntu/x86_64/3bf863cc.pub
# Add repo
sudo add-apt-repository "deb https://developer.download.nvidia.com/compute/cuda/repos/wsl-ubuntu/x86_64/ /"
# Install CUDA
sudo apt update
sudo apt install -y cuda

Set environment variables:

export CUDA_HOME=/usr/local/cuda-13.0
export PATH=$PATH:$CUDA_HOME/bin
export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:$CUDA_HOME/lib64
export PATH="$PATH:/usr/lib/wsl/lib"

Verify Installation

nvidia-smi
nvcc --version

CUDA Stress‑Test Example

The following cuda_stress_test.cu performs matrix multiplication, vector addition and reduction while measuring GFLOP/s for two minutes.

#include <stdio.h>
#include <stdlib.h>
#include <cuda_runtime.h>
#include <device_launch_parameters.h>
#include <chrono>
#include <thread>
#define MATRIX_SIZE 1024
#define BLOCK_SIZE 16
__global__ void matrixMultiply(float* A, float* B, float* C, int size) { /* ... */ }
__global__ void vectorAdd(float* A, float* B, float* C, int size) { /* ... */ }
__global__ void reduceSum(float* input, float* output, int size) { /* ... */ }
#define CHECK_CUDA_ERROR(err) if (err != cudaSuccess) { printf("CUDA Error: %s
", cudaGetErrorString(err)); exit(EXIT_FAILURE); }
void printGPUInfo() { int deviceCount; cudaGetDeviceCount(&deviceCount); printf("Number of CUDA devices: %d
", deviceCount); for (int i=0;i<deviceCount;i++) { cudaDeviceProp prop; cudaGetDeviceProperties(&prop,i); printf("
Device %d: %s
  Compute capability: %d.%d
  Total global memory: %.2f GB
  Multiprocessors: %d
  Max threads per block: %d
  Max threads per multiprocessor: %d
", i, prop.name, prop.major, prop.minor, prop.totalGlobalMem/1024.0/1024.0/1024.0, prop.multiProcessorCount, prop.maxThreadsPerBlock, prop.maxThreadsPerMultiProcessor); }
}
int main(){
    printf("Starting CUDA Stress Test for 2 minutes...
");
    printGPUInfo();
    // allocate, initialize, copy data, launch kernels, measure time, report GFLOP/s
    // (full code omitted for brevity)
    return 0;
}

Build and Run

# compile.sh
#!/bin/bash
export CUDA_HOME=/usr/local/cuda-13.0
export PATH=$PATH:$CUDA_HOME/bin
export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:$CUDA_HOME/lib64
nvcc -o cuda_stress_test cuda_stress_test.cu -O3 -arch=native -Xcompiler -fopenmp -lgomp -DMATRIX_SIZE=1024 -std=c++17
# run_test.sh
#!/bin/bash
echo "Starting CUDA stress test for 2 minutes..."
export CUDA_VISIBLE_DEVICES=0
./cuda_stress_test

Nsight Systems vs. Nsight Compute

Nsight Systems provides system‑wide timeline profiling (CPU‑GPU interaction, memory transfers, kernel launch latency) with low overhead (<5%). Nsight Compute offers kernel‑level metric analysis (SM utilization, memory bandwidth, register pressure, warp divergence).

Nsight Systems CLI

nsys profile --stats=true -o report_name ./your_program

Additional options allow tracing specific APIs, setting capture ranges, and remote GUI usage.

Nsight Compute CLI

ncu --set full --target-processes all -f -o my_report ./your_program

Use --kernel-name to focus on a specific kernel or --metrics to collect custom counters.

NVIDIA DCGM

DCGM (Data Center GPU Manager) collects device‑level metrics (GPU utilization, memory usage, temperature, power, PCIe/NVLink throughput) with virtually no impact on the running application. It is suited for continuous monitoring and integration with Prometheus/Grafana, but it cannot attribute metrics to individual CUDA processes.

DCGM overview
DCGM overview

Key Features

Transparent hardware‑level data collection without modifying applications

Continuous, real‑time monitoring across GPU clusters

Limitations

Cannot distinguish resource usage per CUDA process when multiple applications share a GPU

Does not capture CPU‑side scheduling or host‑GPU interaction details

--- End of guide ---

performance profilingLinuxCUDAGPUWSLNsight
AI Cyberspace
Written by

AI Cyberspace

AI, big data, cloud computing, and networking.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.