Artificial Intelligence 10 min read

How to Write CUDA Kernels in Python – Insights from Nvidia GTC 2025

The article reviews Nvidia GTC 2025’s session on writing CUDA kernels with Python, compares tools such as Numba, CuPy, PyTorch extensions and cuda‑python, demonstrates a segmented reduction example with C++ and Python code, explains the underlying CUDA concepts, and shows how to install and use cuda‑python to simplify kernel development.

Infra Learning Club

Mar 22, 2025

How to Write CUDA Kernels in Python – Insights from Nvidia GTC 2025

Background Nvidia GTC 2025 attracted over 25,000 participants and featured a session titled “1,001 Ways to Write CUDA Kernels in Python (S72449)”, which explored multiple approaches for Python developers to leverage GPU acceleration.

Key Tools for Writing CUDA Kernels

Numba – JIT‑compiles Python functions into CUDA kernels for rapid development.

CuPy – Provides a NumPy‑like API and supports custom element‑wise and reduction kernels.

PyTorch CUDA extensions – Allows custom kernels written in C++/CUDA to be integrated into PyTorch models.

CUDA Python – NVIDIA’s new initiative offering a standard Python interface that may simplify CUDA development.

Case Study: Segmented Reduction

The article presents a segmented reduction problem that groups numbers into three segments (1‑4, 5‑8, 9‑12) and processes them in parallel.

#include <thrust/device_vector.h>
#include <thrust/host_vector.h>
#include <thrust/iterator/counting_iterator.h>
#include "segmented_reduction.cuh"
int main(){
    ...
    // Ensure each block processes one segment
    segmented_reduction_kernel<float, int, block_size, items_per_thread><<<n_segments, block_size>>>(
        thrust::raw_pointer_cast(in_arr.data()),
        thrust::raw_pointer_cast(out_arr.data())
    );
    ...
    return 0;
}

The corresponding kernel implementation is shown below:

template <typename InputT, typename OffsetT, unsigned int ThreadsPerBlock, unsigned int ItemsPerThread>
__global__ void segmented_reduction_kernel(const InputT* in_arr, InputT* out_arr){
    constexpr unsigned int SegmentSize = ThreadsPerBlock * ItemsPerThread;
    __shared__ InputT shared[ThreadsPerBlock];
    OffsetT tid = threadIdx.x;
    OffsetT offset = blockIdx.x * SegmentSize;
    InputT partial_aggregate = 0;
    #pragma unroll
    for (auto i = 0; i < ItemsPerThread; ++i){
        if(tid * ItemsPerThread + i < SegmentSize){
            partial_aggregate += in_arr[offset + tid * ItemsPerThread + i];
        }
    }
    shared[tid] = partial_aggregate;
    __syncthreads();
    #pragma unroll
    for (int stride = ThreadsPerBlock/2; stride > 0; stride >>=1){
        if(tid < stride){ shared[tid] += shared[tid + stride]; }
        __syncthreads();
    }
    if(tid == 0){ out_arr[blockIdx.x] = shared[0]; }
}

Compiling the C++ code uses the command:

nvcc -std=c++17 my_demo.cu -o my_demo

Understanding the Underlying Concepts

To grasp the code, readers need to know:

CUDA thread hierarchy (threads, warps, blocks, grids, etc.).

CUDA memory hierarchy (registers, shared memory, global memory, etc.).

CUDA programming model (asynchronous execution, streams, synchronization, etc.).

GPU architecture details (SMs, occupancy, coalesced access, tensor cores, etc.).

C++ mechanisms for expressing algorithms (templates, iterators, type traits, etc.).

The offline compilation workflow that turns C++ code into .ptx or .cubin files.

Making Kernel Development Easier with Python

Instead of writing kernels in C++, the article suggests using Python, which hides many low‑level details. JIT compilation generates .ptx or .cubin at runtime, and the cuLaunchKernel API launches the kernel.

Historically, kernels were compiled offline with nvcc. In Python, the same functionality can be achieved with a single array declaration.

Introducing cuda‑python

The cuda‑python project (https://github.com/NVIDIA/cuda-python) provides Pythonic access to the CUDA runtime and includes several sub‑packages:

cuda.core – Core runtime functions.

cuda.bindings – Low‑level Python bindings to the CUDA C API.

cuda.cooperative – Parallel algorithms such as sort, scan, reduction, and transform.

cuda.parallel – CUB‑style block‑level and thread‑level primitives for use inside Numba kernels.

It also integrates with nvmath‑python, which exposes NVIDIA’s math libraries to Python.

Installation

Via pip: $ pip install -U cuda-python With all optional dependencies: $ pip install -U cuda-python[all] Or via Conda:

$ conda install -c conda-forge cuda-python

Using cuda‑python to Rewrite the Kernel

The following Python code rewrites the segmented reduction kernel using the @cuda.jit decorator:

@cuda.jit(link=MM.files)
def segmented_reduce_kernel(in_arr, out_arr):
    smem_a = cuda.shared.array(shape=(k, chunk), dtype=dtype)
    smem_b = cuda.shared.array(shape=k, dtype=dtype)
    smem_c = cuda.shared.array(shape=chunk, dtype=dtype)
    bid = cuda.blockIdx.x
    a = in_arr[bid*chunk:(bid+1)*chunk, :]
    load_to_shared_2d(a, smem_a, a_dim)
    set_smem(smem_b, k, 1.)
    cuda.syncthreads()
    MM(1., smem_a, smem_b, 0., smem_c)
    cuda.syncthreads()
    c = out_arr[bid*chunk:(bid+1)*chunk]
    store_from_shared(smem_c, c, c, chunk)

The article concludes by promising a deeper analysis of cuda‑python’s internals in a future post.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Python JIT CUDA GPU Numba CuPy cuda-python

Written by

Infra Learning Club

Infra Learning Club shares study notes, cutting-edge technology, and career discussions.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.