Unlock GPU Power: A Hands‑On Triton Guide for Vector Add, Matrix Multiply & RoPE
This article introduces Triton—a Python‑based GPU programming language—covers essential GPU architecture, walks through practical kernels for vector addition, matrix multiplication, and rotary position encoding, compares performance with PyTorch, and provides debugging tips for high‑performance deep‑learning workloads.
Introduction
Since Nvidia released CUDA in 2006, GPU programming has become central to AI, high‑performance computing, and cloud workloads. Triton, an OpenAI‑backed Python‑centric language and compiler, aims to simplify and accelerate GPU kernel development while lowering the barrier compared to raw CUDA.
GPU Basics
A GPU consists of multiple Streaming Multiprocessors (SMs), each containing many Streaming Processors (SPs) with private registers and local memory. Threads are grouped into warps (32 threads) and blocks, which are scheduled onto SMs. Understanding DRAM, SRAM, and ALU hierarchies is essential for writing efficient kernels.
Triton Overview
Triton builds on CUDA but abstracts low‑level details, offering a "CUDA‑free" experience. It supports multiple back‑ends (Nvidia CUDA, AMD ROCm, Intel CPUs) and provides just‑in‑time compilation via the @triton.jit decorator.
Example 1: Vector Addition
The classic "Hello World" for Triton adds two vectors. The kernel defines program IDs, computes offsets, applies a mask, loads data, performs addition, and stores the result.
import triton.language as tl
@triton.jit
def add_kernel(x_ptr, y_ptr, output_ptr, n_elements, BLOCK_SIZE: tl.constexpr):
pid = tl.program_id(axis=0)
block_start = pid * BLOCK_SIZE
offsets = block_start + tl.arange(0, BLOCK_SIZE)
mask = offsets < n_elements
x = tl.load(x_ptr + offsets, mask=mask)
y = tl.load(y_ptr + offsets, mask=mask)
output = x + y
tl.store(output_ptr + offsets, output, mask=mask)A wrapper launches the kernel with a one‑dimensional grid:
def add(x: torch.Tensor, y: torch.Tensor) -> torch.Tensor:
output = torch.empty_like(x)
assert x.is_cuda and y.is_cuda and output.is_cuda
n_elements = output.numel()
grid = lambda meta: (triton.cdiv(n_elements, meta['BLOCK_SIZE']), )
add_kernel[grid](x, y, output, n_elements, BLOCK_SIZE=1024)
return outputExample 2: Matrix Multiplication
A tiled matrix‑multiply kernel uses a two‑dimensional grid (pid_m, pid_n) and iterates over the K dimension in blocks. The kernel loads tiles of A and B, performs a dot product, accumulates results, and stores the output tile.
@triton.jit
def matmul_kernel(A, B, C, M, N, K, BLOCK_M: tl.constexpr, BLOCK_N: tl.constexpr, BLOCK_K: tl.constexpr):
pid_m = tl.program_id(0)
pid_n = tl.program_id(1)
offsets_m = pid_m * BLOCK_M + tl.arange(0, BLOCK_M)
offsets_n = pid_n * BLOCK_N + tl.arange(0, BLOCK_N)
mask_m = offsets_m < M
mask_n = offsets_n < N
acc = tl.zeros((BLOCK_M, BLOCK_N), dtype=tl.float32)
for start_k in range(0, K, BLOCK_K):
offsets_k = start_k + tl.arange(0, BLOCK_K)
mask_k = offsets_k < K
a_ptrs = A + offsets_m[:, None] * K + offsets_k[None, :]
b_ptrs = B + offsets_k[:, None] * N + offsets_n[None, :]
a = tl.load(a_ptrs, mask=mask_m[:, None] & mask_k[None, :], other=0)
b = tl.load(b_ptrs, mask=mask_k[:, None] & mask_n[None, :], other=0)
acc += tl.dot(a, b)
c_ptrs = C + offsets_m[:, None] * N + offsets_n
mask_c = mask_m[:, None] & mask_n[None, :]
tl.store(c_ptrs, acc, mask=mask_c)Example 3: Rotary Position Encoding (RoPE)
RoPE is a key operation in modern Transformers. The Triton kernel computes q * cos + rotate_half(q) * sin efficiently by loading cosine/sine tables and applying a per‑head group strategy.
@triton.jit
def _rope_embedding(Q, Q_row_stride, cos, cos_row_stride, sin, sin_row_stride, seqlen, head_dim: tl.constexpr, n_heads: tl.constexpr, BLOCK_SIZE: tl.constexpr):
GROUP_SIZE = 4
row = tl.program_id(0)
group = tl.program_id(1)
col = tl.arange(0, BLOCK_SIZE)
half = head_dim // 2
mask = col < half
sin1 = tl.load(sin + (row % seqlen) * sin_row_stride + col, mask=mask, other=0)
cos1 = tl.load(cos + (row % seqlen) * cos_row_stride + col, mask=mask, other=0)
head_start = group * GROUP_SIZE
head_end = tl.minimum(head_start + GROUP_SIZE, n_heads)
for k in range(head_start, head_end):
offs_q1 = row * Q_row_stride + k * head_dim + col
offs_q2 = offs_q1 + half
q1 = tl.load(Q + offs_q1, mask=mask, other=0).to(sin1.dtype)
q2 = tl.load(Q + offs_q2, mask=mask, other=0).to(sin1.dtype)
tl.store(Q + offs_q1, q1 * cos1 - q2 * sin1, mask=mask)
tl.store(Q + offs_q2, q2 * cos1 + q1 * sin1, mask=mask)Performance Testing
Triton provides a @triton.testing.perf_report decorator to benchmark custom kernels against PyTorch baselines across a range of input sizes. Results show that Triton outperforms PyTorch for large vector sizes and matches cuBLAS performance for matrix multiplication.
Debugging
Recent Triton releases support Python debugging via pdb or IDE breakpoints by setting the environment variable TRITON_INTERPRET=1.
Conclusion
By leveraging Triton, developers can write concise, high‑performance kernels for common deep‑learning operators such as vector add, tiled matrix multiply, and RoPE, achieving performance comparable to hand‑tuned CUDA while enjoying a lower learning curve.
DeWu Technology
A platform for sharing and discussing tech knowledge, guiding you toward the cloud of technology.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
