Artificial Intelligence 22 min read

Unlock GPU Power: A Hands‑On Triton Guide for Vector Add, Matrix Multiply & RoPE

This article introduces Triton—a Python‑based GPU programming language—covers essential GPU architecture, walks through practical kernels for vector addition, matrix multiplication, and rotary position encoding, compares performance with PyTorch, and provides debugging tips for high‑performance deep‑learning workloads.

DeWu Technology

Jan 13, 2025

Unlock GPU Power: A Hands‑On Triton Guide for Vector Add, Matrix Multiply & RoPE

Introduction

Since Nvidia released CUDA in 2006, GPU programming has become central to AI, high‑performance computing, and cloud workloads. Triton, an OpenAI‑backed Python‑centric language and compiler, aims to simplify and accelerate GPU kernel development while lowering the barrier compared to raw CUDA.

GPU Basics

A GPU consists of multiple Streaming Multiprocessors (SMs), each containing many Streaming Processors (SPs) with private registers and local memory. Threads are grouped into warps (32 threads) and blocks, which are scheduled onto SMs. Understanding DRAM, SRAM, and ALU hierarchies is essential for writing efficient kernels.

Triton Overview

Triton builds on CUDA but abstracts low‑level details, offering a "CUDA‑free" experience. It supports multiple back‑ends (Nvidia CUDA, AMD ROCm, Intel CPUs) and provides just‑in‑time compilation via the @triton.jit decorator.

Example 1: Vector Addition

The classic "Hello World" for Triton adds two vectors. The kernel defines program IDs, computes offsets, applies a mask, loads data, performs addition, and stores the result.

import triton.language as tl
@triton.jit
def add_kernel(x_ptr, y_ptr, output_ptr, n_elements, BLOCK_SIZE: tl.constexpr):
    pid = tl.program_id(axis=0)
    block_start = pid * BLOCK_SIZE
    offsets = block_start + tl.arange(0, BLOCK_SIZE)
    mask = offsets < n_elements
    x = tl.load(x_ptr + offsets, mask=mask)
    y = tl.load(y_ptr + offsets, mask=mask)
    output = x + y
    tl.store(output_ptr + offsets, output, mask=mask)

A wrapper launches the kernel with a one‑dimensional grid:

def add(x: torch.Tensor, y: torch.Tensor) -> torch.Tensor:
    output = torch.empty_like(x)
    assert x.is_cuda and y.is_cuda and output.is_cuda
    n_elements = output.numel()
    grid = lambda meta: (triton.cdiv(n_elements, meta['BLOCK_SIZE']), )
    add_kernel[grid](x, y, output, n_elements, BLOCK_SIZE=1024)
    return output

Example 2: Matrix Multiplication

A tiled matrix‑multiply kernel uses a two‑dimensional grid (pid_m, pid_n) and iterates over the K dimension in blocks. The kernel loads tiles of A and B, performs a dot product, accumulates results, and stores the output tile.

@triton.jit
def matmul_kernel(A, B, C, M, N, K, BLOCK_M: tl.constexpr, BLOCK_N: tl.constexpr, BLOCK_K: tl.constexpr):
    pid_m = tl.program_id(0)
    pid_n = tl.program_id(1)
    offsets_m = pid_m * BLOCK_M + tl.arange(0, BLOCK_M)
    offsets_n = pid_n * BLOCK_N + tl.arange(0, BLOCK_N)
    mask_m = offsets_m < M
    mask_n = offsets_n < N
    acc = tl.zeros((BLOCK_M, BLOCK_N), dtype=tl.float32)
    for start_k in range(0, K, BLOCK_K):
        offsets_k = start_k + tl.arange(0, BLOCK_K)
        mask_k = offsets_k < K
        a_ptrs = A + offsets_m[:, None] * K + offsets_k[None, :]
        b_ptrs = B + offsets_k[:, None] * N + offsets_n[None, :]
        a = tl.load(a_ptrs, mask=mask_m[:, None] & mask_k[None, :], other=0)
        b = tl.load(b_ptrs, mask=mask_k[:, None] & mask_n[None, :], other=0)
        acc += tl.dot(a, b)
    c_ptrs = C + offsets_m[:, None] * N + offsets_n
    mask_c = mask_m[:, None] & mask_n[None, :]
    tl.store(c_ptrs, acc, mask=mask_c)

Example 3: Rotary Position Encoding (RoPE)

RoPE is a key operation in modern Transformers. The Triton kernel computes q * cos + rotate_half(q) * sin efficiently by loading cosine/sine tables and applying a per‑head group strategy.

@triton.jit
def _rope_embedding(Q, Q_row_stride, cos, cos_row_stride, sin, sin_row_stride, seqlen, head_dim: tl.constexpr, n_heads: tl.constexpr, BLOCK_SIZE: tl.constexpr):
    GROUP_SIZE = 4
    row = tl.program_id(0)
    group = tl.program_id(1)
    col = tl.arange(0, BLOCK_SIZE)
    half = head_dim // 2
    mask = col < half
    sin1 = tl.load(sin + (row % seqlen) * sin_row_stride + col, mask=mask, other=0)
    cos1 = tl.load(cos + (row % seqlen) * cos_row_stride + col, mask=mask, other=0)
    head_start = group * GROUP_SIZE
    head_end = tl.minimum(head_start + GROUP_SIZE, n_heads)
    for k in range(head_start, head_end):
        offs_q1 = row * Q_row_stride + k * head_dim + col
        offs_q2 = offs_q1 + half
        q1 = tl.load(Q + offs_q1, mask=mask, other=0).to(sin1.dtype)
        q2 = tl.load(Q + offs_q2, mask=mask, other=0).to(sin1.dtype)
        tl.store(Q + offs_q1, q1 * cos1 - q2 * sin1, mask=mask)
        tl.store(Q + offs_q2, q2 * cos1 + q1 * sin1, mask=mask)

Performance Testing

Triton provides a @triton.testing.perf_report decorator to benchmark custom kernels against PyTorch baselines across a range of input sizes. Results show that Triton outperforms PyTorch for large vector sizes and matches cuBLAS performance for matrix multiplication.

Debugging

Recent Triton releases support Python debugging via pdb or IDE breakpoints by setting the environment variable TRITON_INTERPRET=1.

Conclusion

By leveraging Triton, developers can write concise, high‑performance kernels for common deep‑learning operators such as vector add, tiled matrix multiply, and RoPE, achieving performance comparable to hand‑tuned CUDA while enjoying a lower learning curve.

Performance optimization Python deep learning CUDA GPU programming Triton Kernel Development

Written by

DeWu Technology

A platform for sharing and discussing tech knowledge, guiding you toward the cloud of technology.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.