DeepHub IMBA
Jun 4, 2026 · Artificial Intelligence
Hand‑Writing a Triton Softmax Kernel: Program Instances, Block Size, Masking & Pointer Arithmetic
This article walks through implementing a row‑wise softmax kernel in Triton, explaining program‑instance mapping, block‑size selection, mask handling, pointer arithmetic, resource‑usage analysis, and a RTX 5090 benchmark that reveals performance cliffs compared to PyTorch.
CUDAGPU kernelPython
0 likes · 9 min read
