Artificial Intelligence 6 min read

How to Build High‑Performance GEMM with NVIDIA CUTLASS

The article explains why standard GEMM libraries may fall short for special matrix shapes, introduces NVIDIA’s open‑source CUTLASS library, details its hierarchical tiling architecture, and walks through a complete device‑API example that customizes tile sizes and data layouts to achieve near‑hand‑written kernel performance on modern GPUs.

Network Intelligence Research Center (NIRC)

Jun 9, 2025

How to Build High‑Performance GEMM with NVIDIA CUTLASS

Introduction: GEMM is a core operation for deep learning and high‑performance computing. Standard libraries like PyTorch and cuBLAS may not fully utilize GPUs for special matrix shapes or fused operators.

Why CUTLASS: NVIDIA’s open‑source CUTLASS provides modular templates that let developers write GEMM kernels close to hand‑written performance.

GPU architecture and tiling necessity: GPUs consist of multiple SMs, each with warps and a memory hierarchy (global, shared, registers). A naïve triple‑loop GEMM suffers from low warp efficiency and excessive global memory traffic, so computation must be tiled across blocks, warps, and instructions.

CUTLASS overview: CUTLASS is a template library that achieves cuBLAS‑level performance by hierarchical tiling—Block level (e.g., 128×128 tiles stored in shared memory), Warp level (e.g., 64×64 sub‑tiles), Instruction level (16×16×16 MMA operations on Tensor Cores).

Computation flow:

Each block loads A/B sub‑tiles from global memory to shared memory.

Warps load data from shared memory into registers.

Execute MMA instructions to compute sub‑matrices.

Accumulate results over multiple rounds and write back to C.

This tiling plus double‑buffered pipeline maximizes bandwidth and compute utilization.

CUTLASS hierarchical API: The C++ template API mirrors the tiling hierarchy, allowing flexible configuration. The Device layer ( cutlass::gemm::device::Gemm) is the entry point; the Kernel layer implements the main loop; the Collective layer defines data‑load and MMA strategies; the Atom layer wraps a single MMA instruction.

Simple example: The following code shows a minimal CUTLASS device‑API GEMM that specifies data types, layouts, architecture (Sm80), and tile shapes. Users can tune these template parameters to achieve optimal performance for their scenario.

using Gemm = cutlass::gemm::device::Gemm<
    // Input matrix data type and layout
    float, cutlass::layout::RowMajor,
    float, cutlass::layout::ColumnMajor,
    float, cutlass::layout::RowMajor,
    float,
    // Underlying MMA instruction class
    cutlass::arch::OpClassTensorOp,
    // Architecture‑specific optimization for Ada
    cutlass::arch::Sm80,
    // Tile sizes for Block, Warp, Op levels
    GemmShape<128, 128, 16>,
    GemmShape<64, 64, 16>,
    GemmShape<16, 16, 16>
>;

By adjusting the template arguments (data type, layout, tile dimensions), developers can tailor the GEMM kernel to their workload while leveraging CUTLASS’s automated code generation.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

High‑Performance Computing CUDA GPU Matrix Multiplication CUTLASS GEMM

Written by

Network Intelligence Research Center (NIRC)

NIRC is based on the National Key Laboratory of Network and Switching Technology at Beijing University of Posts and Telecommunications. It has built a technology matrix across four AI domains—intelligent cloud networking, natural language processing, computer vision, and machine learning systems—dedicated to solving real‑world problems, creating top‑tier systems, publishing high‑impact papers, and contributing significantly to the rapid advancement of China's network technology.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.