Network Intelligence Research Center (NIRC)
Mar 26, 2025 · Artificial Intelligence
Enable Traditional LLMs to Use DeepSeek’s Multi‑Head Latent Attention Without Retraining
The paper introduces MHA2MLA, a data‑efficient fine‑tuning framework that converts pre‑trained multi‑head attention LLMs to DeepSeek’s Multi‑Head Latent Attention architecture, achieving up to 92% KV‑cache compression with less than 0.5% performance loss on long‑context tasks.
Inference efficiencyLLMLow-Rank Approximation
0 likes · 8 min read
