Artificial Intelligence 8 min read

Enable Traditional LLMs to Use DeepSeek’s Multi‑Head Latent Attention Without Retraining

The paper introduces MHA2MLA, a data‑efficient fine‑tuning framework that converts pre‑trained multi‑head attention LLMs to DeepSeek’s Multi‑Head Latent Attention architecture, achieving up to 92% KV‑cache compression with less than 0.5% performance loss on long‑context tasks.

Network Intelligence Research Center (NIRC)

Mar 26, 2025

Enable Traditional LLMs to Use DeepSeek’s Multi‑Head Latent Attention Without Retraining

Background and Motivation

Large language models (LLMs) with standard multi‑head attention (MHA) incur high training and inference costs, especially because the key‑value (KV) cache grows linearly with sequence length, becoming a bottleneck during generation. DeepSeek’s Multi‑Head Latent Attention (MLA) compresses the KV cache but requires a migration path for existing MHA models.

Research Question

How can existing MHA‑based LLMs be adapted to the MLA architecture quickly and economically, preserving performance while reducing inference cost?

Related Work

Grouped‑Query Attention (GQA) and Multi‑Query Attention (MQA) share KV across heads to shrink the cache but degrade accuracy. Linear Transformers, RWKV, and Mamba replace softmax attention with linear or state‑space models, yet they underperform on autoregressive generation. Prior studies on Partial‑RoPE show that not all RoPE dimensions equally affect attention scores, suggesting that some can be removed without harming performance.

Core Idea: MHA2MLA Framework

The proposed MHA2MLA framework performs a data‑efficient fine‑tuning that aligns a pre‑trained MHA model with the MLA architecture. It consists of two main components:

Partial‑RoPE: Identify low‑impact RoPE dimensions and convert them to NoPE (no position embedding) dimensions, enabling alignment with MLA.

Low‑Rank Approximation: Apply joint singular value decomposition (SVD) on the NoPE‑filtered KV matrices to obtain a compact low‑rank representation, further shrinking the cache.

Technical Details

Partial‑RoPE dimension selection: Four strategies are evaluated—high‑frequency preservation, low‑frequency preservation, uniform sampling, and head‑wise 2‑norm contribution—to keep the top‑k most influential RoPE dimensions.

RoPE → NoPE conversion: Unselected dimensions are stripped of positional encoding, becoming neutral dimensions compatible with MLA.

Low‑Rank Approximation: Joint SVD is performed on the concatenated Key and Value matrices after RoPE removal. The top‑k singular values and vectors are retained to reconstruct low‑rank KV matrices.

Experiments

Experiments were conducted on four LLM sizes (135M, 360M, 1.7B, Llama‑2‑7B) pretrained with either MHA or GQA. The models were fine‑tuned on the SmolLM pre‑training corpus. Evaluation metrics included commonsense reasoning and long‑context generation. Baselines comprised the original LLMs and KV‑cache‑quantized variants. Ablation studies examined the impact of different Partial‑RoPE selection strategies and SVD configurations.

Results

Economic breakthrough: MHA2MLA achieves 92.19% KV‑cache compression on Llama‑2‑7B with only 0.3%–0.6% of the original pre‑training data, while long‑context performance drops ≤0.5%.

Synergistic design: The combination of contribution‑aware Partial‑RoPE and joint SVD preserves critical positional information and reduces knowledge loss.

Scalability: Consistent compression‑to‑accuracy trade‑offs were observed across model sizes, suggesting applicability to larger models (e.g., 70B).

Contributions and Limitations

The work contributes a practical migration pathway for existing MHA LLMs to MLA without full retraining, demonstrates compatibility with KV‑cache quantization, and provides extensive ablation insights. Limitations include validation only up to 7B parameters and limited exploration of parameter‑efficient fine‑tuning (e.g., freezing feed‑forward networks).

Reference

Ji, T., Gui, T., et al. (2025). “MHA2MLA: Towards Economical Inference – Enabling DeepSeek’s Multi‑Head Latent Attention in Any Transformer‑based LLM.” arXiv preprint arXiv:2502.14837.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

LLM model compression multi-head attention Multi-Head Latent Attention Inference efficiency Low-Rank Approximation Partial-RoPE

Written by

Network Intelligence Research Center (NIRC)

NIRC is based on the National Key Laboratory of Network and Switching Technology at Beijing University of Posts and Telecommunications. It has built a technology matrix across four AI domains—intelligent cloud networking, natural language processing, computer vision, and machine learning systems—dedicated to solving real‑world problems, creating top‑tier systems, publishing high‑impact papers, and contributing significantly to the rapid advancement of China's network technology.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.