8 min read

Flash-MLA: Boosting LLM Inference Speed on Nvidia Hopper GPUs

Flash-MLA is an open‑source GPU kernel optimized for Nvidia Hopper GPUs that compresses the KV cache of multi‑head attention, cutting memory usage by up to 93.3% and delivering 580 TFLOPS compute, thereby dramatically accelerating large‑language‑model inference while lowering cost.

AI Algorithm Path

Feb 24, 2025

Flash-MLA: Boosting LLM Inference Speed on Nvidia Hopper GPUs

On Monday DeepSeek launched its open‑source week and released Flash‑MLA, a GPU kernel built for Nvidia Hopper GPUs (e.g., H800) that targets variable‑length sequence decoding in large‑language‑model (LLM) inference.

MLA (Multi‑head‑Attention variant) compresses the KV cache into a latent vector c_t for each token, reducing the cache size from seq_len × 2 × n_h × d_h to seq_len × d_c. DeepSeek‑V2 measurements show a memory‑usage reduction of up to 93.3%.

Flash‑MLA implements this compression using BF16 precision, paging KV cache, and 64‑element blocks. It requires a Hopper GPU, CUDA 12.3+ and PyTorch 2.0+, and draws inspiration from FlashAttention 2/3 and Nvidia’s cutlass library.

The kernel achieves a memory bandwidth of roughly 3000 GB/s (near the H800 peak of 3350 GB/s) and a compute bound of 580 TFLOPS, far exceeding the theoretical BF16 peak of 260 TFLOPS on the same hardware. The author attributes the extra performance to aggressive tensor‑core utilization and custom MLA‑specific kernel optimizations.

In DeepSeek‑V2 (236 B parameters, 128 K context) MLA reduces KV cache by 93.3%, cuts training cost by 42.5% and raises generation throughput by 5.76×. DeepSeek‑V3 (671 B parameters) demonstrates MLA’s scalability, completing training in two months for $5.58 M, highlighting the economic benefits.

Adoption challenges remain because many large‑model providers still rely on grouped‑query attention (GQA). Generalization across tasks and hardware needs further validation, and future work may explore optimizations for other GPU architectures or mixed‑precision formats such as FP8 to push performance beyond the current 580 TFLOPS.

Overall, Flash‑MLA represents a key breakthrough for efficient LLM inference on Hopper GPUs, delivering near‑theoretical memory and compute efficiency while lowering inference cost, and is poised to become a foundational component of future high‑performance AI deployment.

DeepSeek LLM inference GPU Optimization MLA Flash-MLA Nvidia Hopper

Written by

AI Algorithm Path

A public account focused on deep learning, computer vision, and autonomous driving perception algorithms, covering visual CV, neural networks, pattern recognition, related hardware and software configurations, and open-source projects.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.