How to Write High‑Performance GPU Code with OpenAI Triton
This article introduces OpenAI's Triton language, compares its block‑wise programming model to traditional CUDA, walks through vector‑addition and fused‑softmax kernel implementations, and presents benchmark results that demonstrate significant speedups over native PyTorch operations.
OpenAI Triton is an open‑source programming language and compiler designed for deep‑learning and high‑performance computing on GPUs, aiming to provide higher productivity than CUDA while delivering comparable or better performance.
Background and motivation : Traditional CUDA programming is complex, and existing DSLs (e.g., Tiramisu, Halide, TVM) often lag behind hand‑written kernels such as cuBLAS, cuDNN, or TensorRT. Triton seeks to lower the difficulty of GPU programming while improving operator efficiency.
Relationship with CUDA : Unlike CUDA’s single‑program‑multiple‑data (SPMD) model that programs at fine‑grained thread level, Triton adopts a block‑wise paradigm. In matrix multiplication, Triton iterates over blocks, creating an iteration space that offers more flexibility for sparse operations and enables the compiler to aggressively optimize data locality and parallelism.
Example 1 – Vector addition kernel :
Demonstrates Triton’s basic programming style and the use of the @triton.jit decorator to define a JIT‑compiled kernel.
The kernel receives pointers x_ptr, y_ptr, and output_ptr, a compile‑time constant BLOCK_SIZE, and computes element‑wise addition using tl.load, arithmetic, and tl.store with masking to handle non‑multiple‑of‑block sizes.
A helper Python function allocates tensors on the GPU, calculates the grid size with triton.cdiv, launches the kernel via add_kernel[grid], and returns the result asynchronously.
Example 2 – Fused softmax kernel :
Motivation: A naïve PyTorch softmax reads and writes many more elements (≈5MN+2M reads, 3MN+2M writes) than a fused kernel that reads once and writes once (≈MN+M), theoretically offering ~4× speedup.
The Triton kernel processes each row of the input matrix, normalizes it, and writes the result, with the constraint that each block size must be a power of two, requiring padding and careful masking.
Helper functions compute triton.next_power_of_2 for block sizing and launch the kernel with dynamic grid configuration.
Performance testing :
Benchmarks were run on increasing vector sizes, measuring throughput in GB/s.
Results show Triton achieving up to 4× higher throughput than Torch JIT for the vector‑addition kernel and a clear advantage over native torch.softmax for the fused softmax, while noting that PyTorch’s softmax remains more general.
Conclusion :
Triton provides a Python‑like interface that simplifies GPU kernel development without requiring deep CUDA expertise.
Understanding basic GPU architecture and parallel‑computing principles remains important.
Through automatic configuration optimization, Triton can match or exceed hand‑written CUDA performance on various hardware.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Network Intelligence Research Center (NIRC)
NIRC is based on the National Key Laboratory of Network and Switching Technology at Beijing University of Posts and Telecommunications. It has built a technology matrix across four AI domains—intelligent cloud networking, natural language processing, computer vision, and machine learning systems—dedicated to solving real‑world problems, creating top‑tier systems, publishing high‑impact papers, and contributing significantly to the rapid advancement of China's network technology.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
