Getting Started with cuda‑python and an Introduction to cuTicle
This article explains the cuda‑python ecosystem—including its core packages, installation via pip or conda, the experimental cuda.core API, a full Python‑to‑CUDA workflow with NVRTC compilation, performance comparison to C++, the covered APIs, and an overview of NVIDIA's new cuTicle programming model.
cuda‑python provides Python access to the NVIDIA CUDA platform and consists of four main packages: cuda.core for runtime and core functions, cuda.bindings for low‑level C API bindings, cuda.cooperative for parallel algorithms such as sort, scan, reduce, and transform, and cuda.parallel which exposes reusable CUB primitives for Numba kernels.
Installation
Install the runtime package with pip:
$ pip install cuda-core[cu12] # use cu11 for CUDA 11Or with conda:
$ conda install -c conda-forge cuda-core cuda-version=12Experimental API
The cuda.core.experimental API simplifies inter‑operation with other Python GPU libraries, but all functions are still experimental and may be removed once the API stabilises, so production use should be cautious.
cuda‑bindings
cuda‑bindingscannot be installed independently; it is pulled in via the cuda‑python meta‑package. Installing with the all extra brings in optional dependencies such as nvidia-cuda-nvrtc-cu12, nvidia-nvjitlink-cu12>=12.3, and nvidia-cuda-nvcc-cu12:
$ pip install -U cuda-python[all]Python‑to‑CUDA Workflow
Because Python is interpreted, device code must be compiled to PTX at runtime using NVRTC, then loaded via the NVIDIA driver API. The typical steps are:
Import driver and nvrtc from cuda.bindings.
Write the kernel as a C‑style string and create an NVRTC program.
Compile the program with appropriate architecture flags (e.g., --gpu-architecture=compute_80).
Retrieve the PTX binary.
Initialize the CUDA driver, obtain a device handle, and create a context.
Load the PTX as a module and get the kernel function.
Allocate device memory with cuMemAlloc and copy host data using cuMemcpyHtoDAsync.
Prepare kernel arguments (device pointers) and launch the kernel with cuLaunchKernel, specifying grid and block dimensions.
Copy results back with cuMemcpyDtoHAsync and synchronize the stream.
Validate the output against a NumPy reference and clean up resources.
The example implements a SAXPY kernel ( out = a*x + y) with 512 threads per block and 32768 blocks, demonstrating memory allocation, asynchronous copies, and kernel launch.
Performance
The Python implementation achieves nearly identical kernel execution time to the native C++ version (≈352 µs) and comparable overall application runtime (≈1.08 s vs. 1.076 s). Profiling can be performed with nsys profile -s none -t cuda --stats=true <executable>.
Covered APIs
CUDA driver API
CUDA runtime API
NVRTC
NVJITLINK
NVVM (libnvml)
cuTicle Overview
cuTicle is NVIDIA’s new programming paradigm that shifts from the traditional SIMT (single‑instruction‑multiple‑thread) model to a Tensor‑level abstraction. Instead of mapping individual data elements to threads, Tile programming operates on whole arrays or tensors, simplifying the mapping logic and allowing developers to focus on block‑level computation.
Overall, the article provides a step‑by‑step guide to using cuda‑python, demonstrates that its performance matches native CUDA C++, and introduces cuTicle as a higher‑level abstraction for tensor‑centric GPU programming.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Infra Learning Club
Infra Learning Club shares study notes, cutting-edge technology, and career discussions.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
