How MoSLoRA Reinvents Low‑Rank Adaptation with Mixer Matrices
This article analyzes the Mixture‑of‑Subspaces in Low‑Rank Adaptation (MoSLoRA) paper, explaining its motivation, design choices that replace LoRA's gate with a mixer matrix, connections to multi‑head attention, experimental findings on LLaMA‑3 fine‑tuning, and theoretical proofs of its re‑parameterization properties.
Background and Motivation
The author introduces the paper "Mixture‑of‑Subspaces in Low‑Rank Adaptation" (MoSLoRA) and provides the arXiv link (https://arxiv.org/pdf/2406.11909) and the GitHub repository (https://github.com/wutaiqiang/MoSLoRA). Traditional LoRA is extended by inserting a Mixer matrix to combine information from different subspaces.
Initial Idea
Earlier works attempted to embed LoRA inside Mixture‑of‑Experts (MoE) by treating LoRA as an expert, which suffered from lack of motivation, reduced mergeability, and slower training. The author wondered what would happen if MoE were placed inside LoRA, i.e., using a gate + multiple experts for LoRA's lora_A and lora_B.
The most straightforward design is to embed a MoE gate inside LoRA, but this introduces a gate that couples with the input x, preventing the LoRA weights from being merged back into the original model and causing inference latency.
Removing the Gate
To retain mergeability, the gate is removed and all experts are used simultaneously, effectively splitting the LoRA rank r into r/k for each expert (e.g., r/3 in the illustrated case). This yields the two‑subspace‑mixing method described in the paper.
Multi‑Head Attention Perspective
When each expert has size r/k, the structure resembles multi‑head attention: dimension splitting, parallel processing, and final merging. The author examines two ways to split: (i) splitting the input dimension d and (ii) splitting the rank r. Visualizations show that splitting r into two sub‑blocks is more elegant.
By further combining the parallel branches ("twist‑the‑braid" operation), the author derives a formulation where (A1+A2)(B1+B2) = A1B1 + A2B2 + A1B2 + A2B1, adding two cross terms.
Experimental Results
Experiments fine‑tune LLaMA‑3 on commonsense reasoning tasks, showing modest improvements. However, the naive implementation suffers from low computational efficiency because each expert is forwarded sequentially. The author suggests using a multi‑head attention‑style implementation where the linear layers are concatenated, forward‑ed once, and then the output vector is split.
Proof of Concept
The author provides a supplemental PDF proof (https://github.com/wutaiqiang/MoSLoRA/blob/main/MoSLoRA_proof.pdf) and additional diagrams illustrating that the method is equivalent to inserting a learnable Mixer matrix only when the intermediate matrix W is orthogonal; otherwise, optimization trajectories differ.
Introducing the Mixer Matrix
Replacing the fixed identity Mixer in vanilla LoRA with a learnable matrix (the "Mixer" matrix) yields the MoSLoRA method. The original LoRA uses an identity Mixer; the two‑subspace‑mixing approach inserts a fixed butterfly factor; MoSLoRA makes the entire Mixer learnable.
Note 1: This form resembles AdaLoRA, which uses an SVD‑based factor with orthogonal constraints. Note 2: A concurrent work, FLoRA, approaches the problem via Tucker decomposition (see the FLoRA paper for details).
Back to the MoE Viewpoint
From the MoE perspective, the Mixer can be seen as a gate‑generated weight that is input‑independent, dense (all experts used), and mergeable. This contrasts with conventional LoRA+MoE designs that have input‑dependent gates, sparse expert selection, and non‑mergeable weights.
This weight does not depend on the input, ensuring mergeability.
The weight is dense, meaning every expert contributes.
Vanilla LoRA corresponds to a Mixer fixed as the identity matrix.
The author concludes that the presented reasoning clarifies the entire thought process behind MoSLoRA, highlighting how a seemingly complex design simplifies to a learnable Mixer that preserves mergeability and zero‑latency inference.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Baobao Algorithm Notes
Author of the BaiMian large model, offering technology and industry insights.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
