Tagged articles

DeepGEMM

4 articles · Page 1 of 1

Apr 17, 2026 · Artificial Intelligence

DeepSeek Introduces Mega MoE and FP4 Indexer – Inside the New GPU Fusion Kernel

DeepSeek's latest DeepGEMM update adds Mega MoE, a fused GPU kernel that collapses the entire Mixture‑of‑Experts pipeline and overlaps computation with NVLink communication, while also unveiling an FP4 indexer and FP8×FP4 precision experiments, signaling a push toward highly efficient large‑scale AI training.

DeepGEMMDeepSeekFP4 Indexer

0 likes · 5 min read

DeepSeek Introduces Mega MoE and FP4 Indexer – Inside the New GPU Fusion Kernel

360 Zhihui Cloud Developer

Apr 1, 2025 · Artificial Intelligence

DeepGEMM vs Cutlass vs Triton: Which GPU GEMM Library Delivers the Best FP8 Performance?

This article presents a comprehensive benchmark of DeepGEMM, Cutlass, and Triton on NVIDIA H20 and H800 GPUs, analyzing TFLOPS, bandwidth, latency, and speedup across various matrix sizes, and concludes which library is optimal for different workload scenarios.

BenchmarkCUDADeepGEMM

0 likes · 15 min read

DeepGEMM vs Cutlass vs Triton: Which GPU GEMM Library Delivers the Best FP8 Performance?

NewBeeNLP

Feb 27, 2025 · Industry Insights

How DeepSeek’s Open‑Source Tools Exploit China‑Specific H800 GPUs to Boost AI Performance

The article analyzes DeepSeek’s three open‑source projects—FlashMLA, DeepEP, and DeepGEMM—showing how they optimize for the China‑only NVIDIA H800 GPU, contrast this with the abundant hardware resources of Western AI firms, and highlight the growing demand for talent that masters both AI models and GPU hardware.

AI hardwareDeepEPDeepGEMM

0 likes · 7 min read

How DeepSeek’s Open‑Source Tools Exploit China‑Specific H800 GPUs to Boost AI Performance

DataFunTalk

Feb 26, 2025 · Artificial Intelligence

DeepGEMM: An Open‑Source FP8 GEMM Library for Efficient AI Model Training and Inference

DeepGEMM is an open‑source FP8‑precision GEMM library that delivers up to 1350 TFLOPS on NVIDIA Hopper GPUs, offering JIT‑compiled, lightweight code (~300 lines) for dense and MoE matrix multiplication, with easy deployment, configurable environment variables, and performance advantages over CUTLASS for large AI models.

AI accelerationDeepGEMMFP8

0 likes · 7 min read

DeepGEMM: An Open‑Source FP8 GEMM Library for Efficient AI Model Training and Inference