Getting Started with cuda‑python and an Introduction to cuTicle

This article explains the cuda‑python ecosystem—including its core packages, installation via pip or conda, the experimental cuda.core API, a full Python‑to‑CUDA workflow with NVRTC compilation, performance comparison to C++, the covered APIs, and an overview of NVIDIA's new cuTicle programming model.

Infra Learning Club
Infra Learning Club
Infra Learning Club
Getting Started with cuda‑python and an Introduction to cuTicle

cuda‑python provides Python access to the NVIDIA CUDA platform and consists of four main packages: cuda.core for runtime and core functions, cuda.bindings for low‑level C API bindings, cuda.cooperative for parallel algorithms such as sort, scan, reduce, and transform, and cuda.parallel which exposes reusable CUB primitives for Numba kernels.

Installation

Install the runtime package with pip:

$ pip install cuda-core[cu12]  # use cu11 for CUDA 11

Or with conda:

$ conda install -c conda-forge cuda-core cuda-version=12

Experimental API

The cuda.core.experimental API simplifies inter‑operation with other Python GPU libraries, but all functions are still experimental and may be removed once the API stabilises, so production use should be cautious.

cuda‑bindings

cuda‑bindings

cannot be installed independently; it is pulled in via the cuda‑python meta‑package. Installing with the all extra brings in optional dependencies such as nvidia-cuda-nvrtc-cu12, nvidia-nvjitlink-cu12>=12.3, and nvidia-cuda-nvcc-cu12:

$ pip install -U cuda-python[all]

Python‑to‑CUDA Workflow

Because Python is interpreted, device code must be compiled to PTX at runtime using NVRTC, then loaded via the NVIDIA driver API. The typical steps are:

Import driver and nvrtc from cuda.bindings.

Write the kernel as a C‑style string and create an NVRTC program.

Compile the program with appropriate architecture flags (e.g., --gpu-architecture=compute_80).

Retrieve the PTX binary.

Initialize the CUDA driver, obtain a device handle, and create a context.

Load the PTX as a module and get the kernel function.

Allocate device memory with cuMemAlloc and copy host data using cuMemcpyHtoDAsync.

Prepare kernel arguments (device pointers) and launch the kernel with cuLaunchKernel, specifying grid and block dimensions.

Copy results back with cuMemcpyDtoHAsync and synchronize the stream.

Validate the output against a NumPy reference and clean up resources.

The example implements a SAXPY kernel ( out = a*x + y) with 512 threads per block and 32768 blocks, demonstrating memory allocation, asynchronous copies, and kernel launch.

Performance

The Python implementation achieves nearly identical kernel execution time to the native C++ version (≈352 µs) and comparable overall application runtime (≈1.08 s vs. 1.076 s). Profiling can be performed with nsys profile -s none -t cuda --stats=true <executable>.

Covered APIs

CUDA driver API

CUDA runtime API

NVRTC

NVJITLINK

NVVM (libnvml)

cuTicle Overview

cuTicle is NVIDIA’s new programming paradigm that shifts from the traditional SIMT (single‑instruction‑multiple‑thread) model to a Tensor‑level abstraction. Instead of mapping individual data elements to threads, Tile programming operates on whole arrays or tensors, simplifying the mapping logic and allowing developers to focus on block‑level computation.

cuTicle diagram
cuTicle diagram
Tile vs SIMT
Tile vs SIMT
Tile programming model
Tile programming model
Tensor granularity
Tensor granularity

Overall, the article provides a step‑by‑step guide to using cuda‑python, demonstrates that its performance matches native CUDA C++, and introduces cuTicle as a higher‑level abstraction for tensor‑centric GPU programming.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

PythonCUDAGPUNvidiacuda-pythoncuTicleNVRTC
Infra Learning Club
Written by

Infra Learning Club

Infra Learning Club shares study notes, cutting-edge technology, and career discussions.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.