DeepGEMM: An Open‑Source FP8 GEMM Library for Efficient AI Model Training and Inference
DeepGEMM is an open‑source FP8‑precision GEMM library that delivers up to 1350 TFLOPS on NVIDIA Hopper GPUs, offering JIT‑compiled, lightweight code (~300 lines) for dense and MoE matrix multiplication, with easy deployment, configurable environment variables, and performance advantages over CUTLASS for large AI models.
DeepGEMM is a high‑performance computing library focused on FP8 general matrix multiplication (GEMM), designed to accelerate AI models such as DeepSeek‑V3 and R1, especially large mixture‑of‑experts (MoE) models that require massive matrix operations.
Key Innovations
Runs on NVIDIA Hopper GPUs (e.g., H100) achieving more than 1350 TFLOPS of FP8 compute by exploiting Tensor Core and Tensor Memory Accelerator (TMA) features.
Implements a two‑stage accumulation technique to preserve accuracy despite the low precision of FP8.
Supports both dense matrix layouts and two MoE layouts (contiguous and masked‑grouped), with specialized grouped‑GEMM interfaces such as m_grouped_gemm_fp8_fp8_bf16_nt_contiguous and m_grouped_gemm_fp8_fp8_bf16_nt_masked .
The core logic is only about 300 lines of code, written as a teaching‑style example with no heavy dependencies, and is fully JIT‑compiled at runtime.
Performance
Compared with the optimized CUTLASS 3.6 library, DeepGEMM delivers 1.1×‑2.7× speed‑ups across a variety of matrix sizes, with certain sizes (e.g., 64/2112/7168) reaching the upper end of the acceleration range.
Technical Parameters
FP8 formats: E4M3 and E5M2, selected based on hardware support.
Hardware requirement: NVIDIA Hopper architecture (sm_90a), CUDA toolkit 12.8 or newer.
Environment variables such as DG_CACHE_DIR , DG_DISABLE_FFMA_INTERLEAVE , DG_JIT_PRINT_NVCC_COMMAND , and DG_JIT_DEBUG control caching, optimization toggles, and debugging output.
Utility functions like deep_gemm.set_num_sms and deep_gemm.get_m_alignment_for_contiguous_layout help fine‑tune performance.
Deployment Steps
Ensure a Hopper‑class GPU is installed and the latest CUDA toolchain is available.
Clone the DeepGEMM repository from GitHub; no separate compilation is needed because kernels are generated JIT‑wise.
Configure optional environment variables (e.g., DG_CACHE_DIR , DG_NVCC_COMPILER ) as needed.
Invoke the appropriate API: deep_gemm.gemm_fp8_fp8_bf16_nt for dense GEMM, m_grouped_gemm_fp8_fp8_bf16_nt_contiguous for contiguous MoE layout, or m_grouped_gemm_fp8_fp8_bf16_nt_masked for masked‑grouped layout.
Adjust performance parameters based on matrix dimensions and use helper functions such as deep_gemm.get_tma_aligned_size to ensure proper data alignment.
Enable debugging variables ( DG_JIT_DEBUG , DG_PRINT_REG_REUSE ) to monitor JIT compilation and register usage.
Conclusion
DeepGEMM provides a compact yet powerful solution for FP8‑based matrix multiplication, delivering significant speed‑ups for large AI and MoE models while keeping the codebase minimal and flexible through JIT compilation. Its reliance on Hopper‑specific hardware features means it excels on that platform but may be limited on other GPU architectures.
GitHub: https://github.com/deepseek-ai/DeepGEMM
DataFunTalk
Dedicated to sharing and discussing big data and AI technology applications, aiming to empower a million data scientists. Regularly hosts live tech talks and curates articles on big data, recommendation/search algorithms, advertising algorithms, NLP, intelligent risk control, autonomous driving, and machine learning/deep learning.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.