How to Write CUDA Kernels in Python – Insights from Nvidia GTC 2025
The article reviews Nvidia GTC 2025’s session on writing CUDA kernels with Python, compares tools such as Numba, CuPy, PyTorch extensions and cuda‑python, demonstrates a segmented reduction example with C++ and Python code, explains the underlying CUDA concepts, and shows how to install and use cuda‑python to simplify kernel development.
Background Nvidia GTC 2025 attracted over 25,000 participants and featured a session titled “1,001 Ways to Write CUDA Kernels in Python (S72449)”, which explored multiple approaches for Python developers to leverage GPU acceleration.
Key Tools for Writing CUDA Kernels
Numba – JIT‑compiles Python functions into CUDA kernels for rapid development.
CuPy – Provides a NumPy‑like API and supports custom element‑wise and reduction kernels.
PyTorch CUDA extensions – Allows custom kernels written in C++/CUDA to be integrated into PyTorch models.
CUDA Python – NVIDIA’s new initiative offering a standard Python interface that may simplify CUDA development.
Case Study: Segmented Reduction
The article presents a segmented reduction problem that groups numbers into three segments (1‑4, 5‑8, 9‑12) and processes them in parallel.
#include <thrust/device_vector.h>
#include <thrust/host_vector.h>
#include <thrust/iterator/counting_iterator.h>
#include "segmented_reduction.cuh"
int main(){
...
// Ensure each block processes one segment
segmented_reduction_kernel<float, int, block_size, items_per_thread><<<n_segments, block_size>>>(
thrust::raw_pointer_cast(in_arr.data()),
thrust::raw_pointer_cast(out_arr.data())
);
...
return 0;
}The corresponding kernel implementation is shown below:
template <typename InputT, typename OffsetT, unsigned int ThreadsPerBlock, unsigned int ItemsPerThread>
__global__ void segmented_reduction_kernel(const InputT* in_arr, InputT* out_arr){
constexpr unsigned int SegmentSize = ThreadsPerBlock * ItemsPerThread;
__shared__ InputT shared[ThreadsPerBlock];
OffsetT tid = threadIdx.x;
OffsetT offset = blockIdx.x * SegmentSize;
InputT partial_aggregate = 0;
#pragma unroll
for (auto i = 0; i < ItemsPerThread; ++i){
if(tid * ItemsPerThread + i < SegmentSize){
partial_aggregate += in_arr[offset + tid * ItemsPerThread + i];
}
}
shared[tid] = partial_aggregate;
__syncthreads();
#pragma unroll
for (int stride = ThreadsPerBlock/2; stride > 0; stride >>=1){
if(tid < stride){ shared[tid] += shared[tid + stride]; }
__syncthreads();
}
if(tid == 0){ out_arr[blockIdx.x] = shared[0]; }
}Compiling the C++ code uses the command:
nvcc -std=c++17 my_demo.cu -o my_demoUnderstanding the Underlying Concepts
To grasp the code, readers need to know:
CUDA thread hierarchy (threads, warps, blocks, grids, etc.).
CUDA memory hierarchy (registers, shared memory, global memory, etc.).
CUDA programming model (asynchronous execution, streams, synchronization, etc.).
GPU architecture details (SMs, occupancy, coalesced access, tensor cores, etc.).
C++ mechanisms for expressing algorithms (templates, iterators, type traits, etc.).
The offline compilation workflow that turns C++ code into .ptx or .cubin files.
Making Kernel Development Easier with Python
Instead of writing kernels in C++, the article suggests using Python, which hides many low‑level details. JIT compilation generates .ptx or .cubin at runtime, and the cuLaunchKernel API launches the kernel.
Historically, kernels were compiled offline with nvcc. In Python, the same functionality can be achieved with a single array declaration.
Introducing cuda‑python
The cuda‑python project (https://github.com/NVIDIA/cuda-python) provides Pythonic access to the CUDA runtime and includes several sub‑packages:
cuda.core – Core runtime functions.
cuda.bindings – Low‑level Python bindings to the CUDA C API.
cuda.cooperative – Parallel algorithms such as sort, scan, reduction, and transform.
cuda.parallel – CUB‑style block‑level and thread‑level primitives for use inside Numba kernels.
It also integrates with nvmath‑python, which exposes NVIDIA’s math libraries to Python.
Installation
Via pip: $ pip install -U cuda-python With all optional dependencies: $ pip install -U cuda-python[all] Or via Conda:
$ conda install -c conda-forge cuda-pythonUsing cuda‑python to Rewrite the Kernel
The following Python code rewrites the segmented reduction kernel using the @cuda.jit decorator:
@cuda.jit(link=MM.files)
def segmented_reduce_kernel(in_arr, out_arr):
smem_a = cuda.shared.array(shape=(k, chunk), dtype=dtype)
smem_b = cuda.shared.array(shape=k, dtype=dtype)
smem_c = cuda.shared.array(shape=chunk, dtype=dtype)
bid = cuda.blockIdx.x
a = in_arr[bid*chunk:(bid+1)*chunk, :]
load_to_shared_2d(a, smem_a, a_dim)
set_smem(smem_b, k, 1.)
cuda.syncthreads()
MM(1., smem_a, smem_b, 0., smem_c)
cuda.syncthreads()
c = out_arr[bid*chunk:(bid+1)*chunk]
store_from_shared(smem_c, c, c, chunk)The article concludes by promising a deeper analysis of cuda‑python’s internals in a future post.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Infra Learning Club
Infra Learning Club shares study notes, cutting-edge technology, and career discussions.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
