Artificial Intelligence 11 min read

Can TransMLA Turn GQA into a More Powerful MLA? A Deep Dive into DeepSeek Models

This article presents a theoretical and experimental analysis of converting Group Query Attention (GQA) models to Multi‑Head Linear Attention (MLA) using the TransMLA method, demonstrating superior expressiveness and performance on DeepSeek‑based large language models while keeping KV‑Cache costs unchanged.

Baobao Algorithm Notes

Feb 17, 2025

Can TransMLA Turn GQA into a More Powerful MLA? A Deep Dive into DeepSeek Models

Introduction

DeepSeek models use a Group Query Attention (GQA) architecture to reduce KV‑Cache memory, but this design limits the expressive power of the attention mechanism. By analysing the Mixture‑of‑Experts (MoE) structure of early DeepSeek models and later versions, it is possible to replace GQA with a Multi‑Head Linear Attention (MLA) design that retains the same KV‑Cache size while providing greater representational capacity.

Relevant resources:

https://huggingface.co/papers/2502.07864

https://github.com/fxmeng/TransMLA

TransMLA Method

Theorem 1

Statement: When the KV‑Cache size is fixed, MLA has strictly greater expressive capacity than GQA.

Proof Sketch

Any GQA can be transformed into an equivalent MLA by replicating the key‑value projection matrices across all query heads.

The replication can be moved from the runtime to the parameter side, yielding a mathematically equivalent multi‑head attention (MHA) formulation.

Applying a low‑rank singular value decomposition (SVD) to the replicated matrices shows that MLA retains at most the same number of degrees of freedom while allowing richer representations.

There exist configurations (e.g., orthogonal channel interactions) that MLA can express but GQA cannot, because GQA forces identical outputs within each query group.

These steps prove that MLA strictly dominates GQA under equal KV‑Cache constraints.

TransMLA Construction

The conversion from a GQA model to MLA proceeds as follows:

Duplicate each key‑value projection matrix s times (where s is the ratio of query heads to key/value heads) and concatenate the copies to match the original number of query heads.

Perform an orthogonal decomposition of the concatenated matrix, separating it into a low‑rank component and a residual. This preserves the original KV‑Cache dimensions.

Introduce a small set of additional parameters—approximately one‑eighth of the original matrix size—to implement the orthogonal transformation without significantly increasing the total model size.

Experimental Evaluation

Base models: Qwen‑2.5‑7B (28 query heads, 4 key/value heads, hidden dimension 128, KV‑Cache dimension 1024) and Qwen‑2.5‑14B (40 query heads, 8 key/value heads, KV‑Cache dimension 2048). After applying TransMLA, the output dimension of each head is unified to 512 while the KV‑Cache size remains unchanged.

Fine‑tuning setup:

Dataset: SmolTalk instruction‑fine‑tuning set (includes math and code tasks).

Framework: torchtune.

Batch size: 16.

Learning rate: 2e‑5.

Epochs: 2.

Training scope: only KV layers are updated. For the original GQA model, the key and value matrices are fine‑tuned. For the TransMLA model, the newly added orthogonal matrices plus the original KV matrices are fine‑tuned.

Results:

TransMLA achieves lower training loss than the baseline GQA model, indicating better fitting to the data.

On downstream benchmarks such as GSM8K, the MLA‑converted models obtain higher accuracy (e.g., 82.11% vs. 81.96% for the 7B model).

Ablation where the orthogonal decomposition is replaced by an identity initialization yields only a marginal gain (0.15%), confirming that the orthogonal transformation is the primary source of performance improvement.

Parameter impact: The additional matrices increase the total parameter count by roughly 1/8 of the original KV matrices, raising a 7.6 B model to about 7.7 B parameters—a negligible overhead.

Conclusion

The analysis demonstrates that any GQA‑based large language model can be transformed into an MLA‑based model without increasing KV‑Cache memory. The TransMLA conversion adds only a modest number of parameters while delivering measurable gains in training stability and downstream task accuracy. Current limitations include the absence of query compression, lack of decoupled RoPE, and independent latent vectors for key and value. Future work will explore these extensions and release the training code and converted models.

large language models Attention DeepSeek model conversion MLA GQA TransMLA

Written by

Baobao Algorithm Notes

Author of the BaiMian large model, offering technology and industry insights.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.