How to Set Up WSL2 GPU Acceleration and Profile CUDA on Windows 11
This guide walks through configuring Windows 11 with WSL2 and Ubuntu 22.04 for GPU‑accelerated CUDA development, installing NVIDIA drivers and CUDA libraries, setting up SSH and firewall rules, running a CUDA stress‑test program, and using Nsight Systems, Nsight Compute, and NVIDIA DCGM for performance profiling and monitoring.
Environment Information
Win11 (recommended) provides GPU device
WSL2 Ubuntu 22.04 provides CUDA runtime
MacOS is used as a programming and performance‑analysis client
Enable WSL and Virtual Machine Platform
Open PowerShell as Administrator
Install WSL (https://learn.microsoft.com/en-us/windows/wsl/install)
Enable the WSL feature
dism.exe /online /enable-feature /featurename:Microsoft-Windows-Subsystem-Linux /all /norestartEnable the Virtual Machine Platform feature
dism.exe /online /enable-feature /featurename:VirtualMachinePlatform /all /norestartRestart the computer
Install Ubuntu on Win11
Open Microsoft Store, search for Ubuntu and install the latest LTS version
Launch Ubuntu from the Start menu; the first run may prompt a WSL version update and ask for a username and password
Open the WSL settings (Start → WSLsetting) to adjust CPU, memory, network, and disk configuration
Configure WSL Network for LAN Access
By default WSL uses NAT (IP in 172.x.x.x). For LAN SSH you can switch to Mirrored mode so the Ubuntu IP matches the host IP, allowing direct LAN access.
$ ip aConfigure Ubuntu SSH
Remove the initial ssh server sudo apt remove openssh-server Update the OS sudo apt update Re‑install ssh sudo apt install openssh-server Edit /etc/ssh/sshd_config to change the port and enable root login
Port 2223
PermitRootLogin yes
PasswordAuthentication yesRestart and enable the ssh service
sudo service ssh restart
sudo systemctl ssh enableOpen Win11 Firewall for SSH
Add a port‑proxy (only needed for NAT mode)
netsh interface portproxy add v4tov4 listenport=2223 listenaddress=0.0.0.0 connectport=2223 connectaddress=<WSL2_IP_Address>Create an inbound firewall rule
netsh advfirewall firewall add rule name=WSL2 dir=in action=allow protocol=TCP localport=2223Linux CUDA on a Windows GPU Machine
Since May 2020 Microsoft provides GPU‑accelerated WSL2. The GPU driver lives in Windows, while the CUDA runtime libraries are installed inside the Linux distribution.
Install NVIDIA Driver on Win11
Download and install the NVIDIA App (Game Ready or Studio). The driver is automatically exposed to WSL as libcuda.so.
Install CUDA Libraries in Ubuntu
# Add NVIDIA repository
wget https://developer.download.nvidia.com/compute/cuda/repos/wsl-ubuntu/x86_64/cuda-wsl-ubuntu.pin
sudo mv cuda-wsl-ubuntu.pin /etc/apt/preferences.d/cuda-repository-pin-600
# Add key
sudo apt-key adv --fetch-keys https://developer.download.nvidia.com/compute/cuda/repos/wsl-ubuntu/x86_64/3bf863cc.pub
# Add repo
sudo add-apt-repository "deb https://developer.download.nvidia.com/compute/cuda/repos/wsl-ubuntu/x86_64/ /"
# Install CUDA
sudo apt update
sudo apt install -y cudaSet environment variables:
export CUDA_HOME=/usr/local/cuda-13.0
export PATH=$PATH:$CUDA_HOME/bin
export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:$CUDA_HOME/lib64
export PATH="$PATH:/usr/lib/wsl/lib"Verify Installation
nvidia-smi nvcc --versionCUDA Stress‑Test Example
The following cuda_stress_test.cu performs matrix multiplication, vector addition and reduction while measuring GFLOP/s for two minutes.
#include <stdio.h>
#include <stdlib.h>
#include <cuda_runtime.h>
#include <device_launch_parameters.h>
#include <chrono>
#include <thread>
#define MATRIX_SIZE 1024
#define BLOCK_SIZE 16
__global__ void matrixMultiply(float* A, float* B, float* C, int size) { /* ... */ }
__global__ void vectorAdd(float* A, float* B, float* C, int size) { /* ... */ }
__global__ void reduceSum(float* input, float* output, int size) { /* ... */ }
#define CHECK_CUDA_ERROR(err) if (err != cudaSuccess) { printf("CUDA Error: %s
", cudaGetErrorString(err)); exit(EXIT_FAILURE); }
void printGPUInfo() { int deviceCount; cudaGetDeviceCount(&deviceCount); printf("Number of CUDA devices: %d
", deviceCount); for (int i=0;i<deviceCount;i++) { cudaDeviceProp prop; cudaGetDeviceProperties(&prop,i); printf("
Device %d: %s
Compute capability: %d.%d
Total global memory: %.2f GB
Multiprocessors: %d
Max threads per block: %d
Max threads per multiprocessor: %d
", i, prop.name, prop.major, prop.minor, prop.totalGlobalMem/1024.0/1024.0/1024.0, prop.multiProcessorCount, prop.maxThreadsPerBlock, prop.maxThreadsPerMultiProcessor); }
}
int main(){
printf("Starting CUDA Stress Test for 2 minutes...
");
printGPUInfo();
// allocate, initialize, copy data, launch kernels, measure time, report GFLOP/s
// (full code omitted for brevity)
return 0;
}Build and Run
# compile.sh
#!/bin/bash
export CUDA_HOME=/usr/local/cuda-13.0
export PATH=$PATH:$CUDA_HOME/bin
export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:$CUDA_HOME/lib64
nvcc -o cuda_stress_test cuda_stress_test.cu -O3 -arch=native -Xcompiler -fopenmp -lgomp -DMATRIX_SIZE=1024 -std=c++17
# run_test.sh
#!/bin/bash
echo "Starting CUDA stress test for 2 minutes..."
export CUDA_VISIBLE_DEVICES=0
./cuda_stress_testNsight Systems vs. Nsight Compute
Nsight Systems provides system‑wide timeline profiling (CPU‑GPU interaction, memory transfers, kernel launch latency) with low overhead (<5%). Nsight Compute offers kernel‑level metric analysis (SM utilization, memory bandwidth, register pressure, warp divergence).
Nsight Systems CLI
nsys profile --stats=true -o report_name ./your_programAdditional options allow tracing specific APIs, setting capture ranges, and remote GUI usage.
Nsight Compute CLI
ncu --set full --target-processes all -f -o my_report ./your_programUse --kernel-name to focus on a specific kernel or --metrics to collect custom counters.
NVIDIA DCGM
DCGM (Data Center GPU Manager) collects device‑level metrics (GPU utilization, memory usage, temperature, power, PCIe/NVLink throughput) with virtually no impact on the running application. It is suited for continuous monitoring and integration with Prometheus/Grafana, but it cannot attribute metrics to individual CUDA processes.
Key Features
Transparent hardware‑level data collection without modifying applications
Continuous, real‑time monitoring across GPU clusters
Limitations
Cannot distinguish resource usage per CUDA process when multiple applications share a GPU
Does not capture CPU‑side scheduling or host‑GPU interaction details
--- End of guide ---
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
