FlashMLA vs FlashInfer: DeepSeek Inference Performance Benchmarks Revealed
The author benchmarks DeepSeek's FlashMLA against FlashInfer and several Triton-based implementations, detailing setup challenges, decode‑only bandwidth results, and observations that the official DeepSeek version leads while Triton optimizations show mixed performance across different head sizes.
Benchmark Overview
A quick performance comparison was conducted between the newly released FlashMLA and several other multi‑head attention (MLA) inference implementations.
Tested Implementations
FlashInfer (completed)
TileLang (pending)
SGLang Triton MLA implementation (pending)
Tsinghua Prof. Zhai team Triton MLA implementation (completed)
Benchmark Script
The benchmark code is available in the following pull request: https://github.com/deepseek-ai/FlashMLA/pull/35 During script development two issues were encountered and fixed:
Inconsistent input‑shape definitions between FlashInfer and FlashMLA prevented correctness checks.
A bug in FlashInfer’s pin_memory flag caused failures when torch.set_default_device was used.
After resolving these problems, only decode‑only bandwidth was measured to focus on inference throughput.
Results
Performance was evaluated across various batch sizes (1, 32, 64, 128) and query‑head dimensions (16, 32, 64, 128). The following figures show the measured decode bandwidth (higher is better):
Key observations :
The official DeepSeek implementation achieved the highest decode bandwidth across all configurations.
FlashInfer was the second‑best performer.
The Triton‑based implementation that incorporates the optimization from Prof. Zhai’s team ( optimize mla decode triton kernel) performed competitively only when the query‑head size was 16; performance dropped sharply for larger heads.
The performance gap is likely due to two factors: (a) the benchmark used the newer FA3 backend for FlashInfer, whereas earlier comparisons used FA2; (b) the Triton kernel was not extensively tuned for the tested configurations.
Relevant Optimization Commit
The Triton kernel improvement referenced in the benchmark is documented in the following commit:
https://github.com/monellz/vllm/commit/feebaa7c063be6bfb590a876741aeef1c5f58cf8Baobao Algorithm Notes
Author of the BaiMian large model, offering technology and industry insights.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
