FlashMLA vs FlashInfer: DeepSeek Inference Performance Benchmarks Revealed

The author benchmarks DeepSeek's FlashMLA against FlashInfer and several Triton-based implementations, detailing setup challenges, decode‑only bandwidth results, and observations that the official DeepSeek version leads while Triton optimizations show mixed performance across different head sizes.

Baobao Algorithm Notes
Baobao Algorithm Notes
Baobao Algorithm Notes
FlashMLA vs FlashInfer: DeepSeek Inference Performance Benchmarks Revealed

Benchmark Overview

A quick performance comparison was conducted between the newly released FlashMLA and several other multi‑head attention (MLA) inference implementations.

Tested Implementations

FlashInfer (completed)

TileLang (pending)

SGLang Triton MLA implementation (pending)

Tsinghua Prof. Zhai team Triton MLA implementation (completed)

Benchmark Script

The benchmark code is available in the following pull request: https://github.com/deepseek-ai/FlashMLA/pull/35 During script development two issues were encountered and fixed:

Inconsistent input‑shape definitions between FlashInfer and FlashMLA prevented correctness checks.

A bug in FlashInfer’s pin_memory flag caused failures when torch.set_default_device was used.

After resolving these problems, only decode‑only bandwidth was measured to focus on inference throughput.

Results

Performance was evaluated across various batch sizes (1, 32, 64, 128) and query‑head dimensions (16, 32, 64, 128). The following figures show the measured decode bandwidth (higher is better):

batch=1 q_head=16
batch=1 q_head=16
batch=1 q_head=32
batch=1 q_head=32
batch=1 q_head=64
batch=1 q_head=64
batch=32 q_head=16
batch=32 q_head=16
batch=32 q_head=32
batch=32 q_head=32
batch=32 q_head=64
batch=32 q_head=64
batch=64 q_head=16
batch=64 q_head=16
batch=64 q_head=32
batch=64 q_head=32
batch=64 q_head=64
batch=64 q_head=64
batch=128 q_head=16
batch=128 q_head=16
batch=128 q_head=32
batch=128 q_head=32
batch=128 q_head=64
batch=128 q_head=64
batch=128 q_head=128
batch=128 q_head=128

Key observations :

The official DeepSeek implementation achieved the highest decode bandwidth across all configurations.

FlashInfer was the second‑best performer.

The Triton‑based implementation that incorporates the optimization from Prof. Zhai’s team ( optimize mla decode triton kernel) performed competitively only when the query‑head size was 16; performance dropped sharply for larger heads.

The performance gap is likely due to two factors: (a) the benchmark used the newer FA3 backend for FlashInfer, whereas earlier comparisons used FA2; (b) the Triton kernel was not extensively tuned for the tested configurations.

Relevant Optimization Commit

The Triton kernel improvement referenced in the benchmark is documented in the following commit:

https://github.com/monellz/vllm/commit/feebaa7c063be6bfb590a876741aeef1c5f58cf8
PerformanceAIDeepSeekbenchmarkInferenceTritonFlashMLA
Baobao Algorithm Notes
Written by

Baobao Algorithm Notes

Author of the BaiMian large model, offering technology and industry insights.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.