Fundamentals 10 min read

How to Write High‑Performance GPU Code with OpenAI Triton

This article introduces OpenAI's Triton language, compares its block‑wise programming model to traditional CUDA, walks through vector‑addition and fused‑softmax kernel implementations, and presents benchmark results that demonstrate significant speedups over native PyTorch operations.

Network Intelligence Research Center (NIRC)
Network Intelligence Research Center (NIRC)
Network Intelligence Research Center (NIRC)
How to Write High‑Performance GPU Code with OpenAI Triton

OpenAI Triton is an open‑source programming language and compiler designed for deep‑learning and high‑performance computing on GPUs, aiming to provide higher productivity than CUDA while delivering comparable or better performance.

Background and motivation : Traditional CUDA programming is complex, and existing DSLs (e.g., Tiramisu, Halide, TVM) often lag behind hand‑written kernels such as cuBLAS, cuDNN, or TensorRT. Triton seeks to lower the difficulty of GPU programming while improving operator efficiency.

Relationship with CUDA : Unlike CUDA’s single‑program‑multiple‑data (SPMD) model that programs at fine‑grained thread level, Triton adopts a block‑wise paradigm. In matrix multiplication, Triton iterates over blocks, creating an iteration space that offers more flexibility for sparse operations and enables the compiler to aggressively optimize data locality and parallelism.

Example 1 – Vector addition kernel :

Demonstrates Triton’s basic programming style and the use of the @triton.jit decorator to define a JIT‑compiled kernel.

The kernel receives pointers x_ptr, y_ptr, and output_ptr, a compile‑time constant BLOCK_SIZE, and computes element‑wise addition using tl.load, arithmetic, and tl.store with masking to handle non‑multiple‑of‑block sizes.

A helper Python function allocates tensors on the GPU, calculates the grid size with triton.cdiv, launches the kernel via add_kernel[grid], and returns the result asynchronously.

Vector addition kernel
Vector addition kernel

Example 2 – Fused softmax kernel :

Motivation: A naïve PyTorch softmax reads and writes many more elements (≈5MN+2M reads, 3MN+2M writes) than a fused kernel that reads once and writes once (≈MN+M), theoretically offering ~4× speedup.

The Triton kernel processes each row of the input matrix, normalizes it, and writes the result, with the constraint that each block size must be a power of two, requiring padding and careful masking.

Helper functions compute triton.next_power_of_2 for block sizing and launch the kernel with dynamic grid configuration.

Fused softmax kernel
Fused softmax kernel

Performance testing :

Benchmarks were run on increasing vector sizes, measuring throughput in GB/s.

Results show Triton achieving up to 4× higher throughput than Torch JIT for the vector‑addition kernel and a clear advantage over native torch.softmax for the fused softmax, while noting that PyTorch’s softmax remains more general.

Performance comparison
Performance comparison

Conclusion :

Triton provides a Python‑like interface that simplifies GPU kernel development without requiring deep CUDA expertise.

Understanding basic GPU architecture and parallel‑computing principles remains important.

Through automatic configuration optimization, Triton can match or exceed hand‑written CUDA performance on various hardware.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

CUDAPyTorchGPU programmingkernel fusionTritonperformance benchmarking
Network Intelligence Research Center (NIRC)
Written by

Network Intelligence Research Center (NIRC)

NIRC is based on the National Key Laboratory of Network and Switching Technology at Beijing University of Posts and Telecommunications. It has built a technology matrix across four AI domains—intelligent cloud networking, natural language processing, computer vision, and machine learning systems—dedicated to solving real‑world problems, creating top‑tier systems, publishing high‑impact papers, and contributing significantly to the rapid advancement of China's network technology.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.