Artificial Intelligence 15 min read

DeepGEMM vs Cutlass vs Triton: Which GPU GEMM Library Delivers the Best FP8 Performance?

This article presents a comprehensive benchmark of DeepGEMM, Cutlass, and Triton on NVIDIA H20 and H800 GPUs, analyzing TFLOPS, bandwidth, latency, and speedup across various matrix sizes, and concludes which library is optimal for different workload scenarios.

360 Zhihui Cloud Developer

Apr 1, 2025

DeepGEMM vs Cutlass vs Triton: Which GPU GEMM Library Delivers the Best FP8 Performance?

1. Background

DeepGEMM is a lightweight FP8 GEMM library built with CUDA and JIT compilation, targeting NVIDIA Hopper tensor cores. It avoids heavy dependencies on CUTLASS and CuTe, offering a concise core kernel of about 300 lines that is easy to study and extend.

2. Validation Runs

2.1 H20 Results

We evaluated all dense matrix shapes that may appear in DeepSeek‑V3/R1 inference on the H20 GPU, comparing DeepGEMM, vLLM Triton, and vLLM Cutlass.

Performance Comparison – DeepGEMM vs Cutlass

Overall, Cutlass performance fluctuates between 0.77× and 3.31× the speed of DeepGEMM.

For small matrices (m ≤ 128), Cutlass can be up to 3.31× faster, but for large k values its advantage drops to 0.49×–0.78×.

For medium matrices (256 ≤ m ≤ 1024), Cutlass is typically 1.0×–1.78× faster.

For large matrices (m = 4096), Cutlass performance is comparable to DeepGEMM (≈1.01×) but can be slower for certain k values.

Performance Comparison – DeepGEMM vs Triton

DeepGEMM consistently outperforms Triton, with speedups ranging from 1.38× to 1.95× across matrix sizes.

For small matrices, DeepGEMM leads Triton by up to 1.95×, especially when k is large.

For medium matrices, DeepGEMM remains faster, though Triton can be competitive in some cases.

Conclusion (H20)

DeepGEMM matches or exceeds expert‑tuned libraries across most shapes.

Cutlass excels on small matrices but loses advantage on very large GEMMs.

Triton is the slowest implementation in all cases.

2.2 H800 Results

We repeated the same set of benchmarks on the newer H800 GPU.

Cutlass vs DeepGEMM

For small matrices (m, n, k ≤ 256), Cutlass is 2–5× faster than DeepGEMM.

For medium matrices (512 ≤ m, n, k ≤ 2048), Cutlass remains 1.0×–3.5× faster, but the gap narrows as matrix size grows.

For large matrices (m, n, k ≥ 4096), DeepGEMM overtakes Cutlass, achieving up to 1.74× higher performance on the largest shapes.

Triton vs DeepGEMM

Triton is 2–3× slower than DeepGEMM in all cases, with some workloads up to 3× slower.

Even where Cutlass loses to DeepGEMM on large matrices, DeepGEMM still far outperforms Triton.

Conclusion (H800)

DeepGEMM provides the best performance on large‑scale GEMMs on H800.

Cutlass is optimal for small matrices but is surpassed by DeepGEMM on very large workloads.

Triton remains the least efficient implementation.

3. Algorithm Comparison

3.1 DeepGEMM

Pros : High compute performance, outperforms Triton on both H20 and H800; compatible with multiple GPUs; easy to integrate.

Cons : Slightly slower than Cutlass on H800 for TFLOPS and bandwidth.

3.2 vLLM Triton

Pros : Flexible kernel development, supports dynamic shapes.

Cons : Lowest compute efficiency; significantly slower than DeepGEMM and Cutlass, especially on H800.

3.3 vLLM Cutlass

Pros : Highest TFLOPS and bandwidth on H800; shortest execution latency (0.16 ms); ideal for high‑throughput, large‑batch inference.

Cons : Limited performance gains on older GPUs like H20; less suited for dynamic workloads.

4. Summary

DeepGEMM : Versatile, strong performance across GPUs, best for general‑purpose and large‑scale GEMM workloads.

vLLM Triton : Good for research and dynamic shape experiments but not for production‑level speed.

vLLM Cutlass : Top choice on high‑end GPUs (H800) for maximum throughput; less benefit on older hardware.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Performance CUDA Benchmark GPU FP8 DeepGEMM GEMM

Written by

360 Zhihui Cloud Developer

360 Zhihui Cloud is an enterprise open service platform that aims to "aggregate data value and empower an intelligent future," leveraging 360's extensive product and technology resources to deliver platform services to customers.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.