DeepGEMM vs Cutlass vs Triton: Which GPU GEMM Library Delivers the Best FP8 Performance?
This article presents a comprehensive benchmark of DeepGEMM, Cutlass, and Triton on NVIDIA H20 and H800 GPUs, analyzing TFLOPS, bandwidth, latency, and speedup across various matrix sizes, and concludes which library is optimal for different workload scenarios.
1. Background
DeepGEMM is a lightweight FP8 GEMM library built with CUDA and JIT compilation, targeting NVIDIA Hopper tensor cores. It avoids heavy dependencies on CUTLASS and CuTe, offering a concise core kernel of about 300 lines that is easy to study and extend.
2. Validation Runs
2.1 H20 Results
We evaluated all dense matrix shapes that may appear in DeepSeek‑V3/R1 inference on the H20 GPU, comparing DeepGEMM, vLLM Triton, and vLLM Cutlass.
Performance Comparison – DeepGEMM vs Cutlass
Overall, Cutlass performance fluctuates between 0.77× and 3.31× the speed of DeepGEMM.
For small matrices (m ≤ 128), Cutlass can be up to 3.31× faster, but for large k values its advantage drops to 0.49×–0.78×.
For medium matrices (256 ≤ m ≤ 1024), Cutlass is typically 1.0×–1.78× faster.
For large matrices (m = 4096), Cutlass performance is comparable to DeepGEMM (≈1.01×) but can be slower for certain k values.
Performance Comparison – DeepGEMM vs Triton
DeepGEMM consistently outperforms Triton, with speedups ranging from 1.38× to 1.95× across matrix sizes.
For small matrices, DeepGEMM leads Triton by up to 1.95×, especially when k is large.
For medium matrices, DeepGEMM remains faster, though Triton can be competitive in some cases.
Conclusion (H20)
DeepGEMM matches or exceeds expert‑tuned libraries across most shapes.
Cutlass excels on small matrices but loses advantage on very large GEMMs.
Triton is the slowest implementation in all cases.
2.2 H800 Results
We repeated the same set of benchmarks on the newer H800 GPU.
Cutlass vs DeepGEMM
For small matrices (m, n, k ≤ 256), Cutlass is 2–5× faster than DeepGEMM.
For medium matrices (512 ≤ m, n, k ≤ 2048), Cutlass remains 1.0×–3.5× faster, but the gap narrows as matrix size grows.
For large matrices (m, n, k ≥ 4096), DeepGEMM overtakes Cutlass, achieving up to 1.74× higher performance on the largest shapes.
Triton vs DeepGEMM
Triton is 2–3× slower than DeepGEMM in all cases, with some workloads up to 3× slower.
Even where Cutlass loses to DeepGEMM on large matrices, DeepGEMM still far outperforms Triton.
Conclusion (H800)
DeepGEMM provides the best performance on large‑scale GEMMs on H800.
Cutlass is optimal for small matrices but is surpassed by DeepGEMM on very large workloads.
Triton remains the least efficient implementation.
3. Algorithm Comparison
3.1 DeepGEMM
Pros : High compute performance, outperforms Triton on both H20 and H800; compatible with multiple GPUs; easy to integrate.
Cons : Slightly slower than Cutlass on H800 for TFLOPS and bandwidth.
3.2 vLLM Triton
Pros : Flexible kernel development, supports dynamic shapes.
Cons : Lowest compute efficiency; significantly slower than DeepGEMM and Cutlass, especially on H800.
3.3 vLLM Cutlass
Pros : Highest TFLOPS and bandwidth on H800; shortest execution latency (0.16 ms); ideal for high‑throughput, large‑batch inference.
Cons : Limited performance gains on older GPUs like H20; less suited for dynamic workloads.
4. Summary
DeepGEMM : Versatile, strong performance across GPUs, best for general‑purpose and large‑scale GEMM workloads.
vLLM Triton : Good for research and dynamic shape experiments but not for production‑level speed.
vLLM Cutlass : Top choice on high‑end GPUs (H800) for maximum throughput; less benefit on older hardware.
360 Zhihui Cloud Developer
360 Zhihui Cloud is an enterprise open service platform that aims to "aggregate data value and empower an intelligent future," leveraging 360's extensive product and technology resources to deliver platform services to customers.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.