Artificial Intelligence 19 min read

SAME: Stabilizing MoE to Reduce Dual Forgetting in Multimodal Continual Instruction Tuning

The paper identifies routing drift and expert drift as the two main causes of forgetting in multimodal continual instruction tuning (MCIT) and proposes SAME, which combines spectral‑aware routing, curvature‑aware scaling, and adaptive expert activation to keep MoE models stable, efficient, and less forgetful across long task sequences.

Machine Learning Algorithms & Natural Language Processing

Jul 1, 2026

SAME: Stabilizing MoE to Reduce Dual Forgetting in Multimodal Continual Instruction Tuning

Multimodal large language models (MLLMs) achieve strong visual‑language abilities through instruction tuning, but in real deployments they must continually learn new tasks, domains, and answer formats, leading to the problem of Multimodal Continual Instruction Tuning (MCIT). Existing MoE‑based MCIT methods assume that task‑specific routing will naturally prevent interference, yet diagnostic experiments reveal two core drift problems:

Routing drift : after learning subsequent tasks, the router assigns old‑task samples to different experts.

Expert drift : even if the router is restored, the experts themselves have been overwritten by new‑task updates.

To address both issues, the authors introduce SAME (Stabilized Mixture‑of‑Experts) , which stabilizes MoE continual learning from three angles:

Spectral‑aware Routing : each MoE layer maintains a running covariance of router inputs, extracts the dominant eigen‑directions, and constrains router gradient updates to the subspace spanned by these directions. This prevents old samples from being reassigned to unrelated experts.

Curvature‑aware Scaling : using the same historical covariance, the method computes a Riemannian scaling factor for LoRA expert weights, shrinking updates along high‑energy directions (important for past tasks) while allowing larger changes on low‑energy directions.

Adaptive Expert Activation : during training on the current task, experts with low current‑task utilization but high historical importance are temporarily frozen, reducing unnecessary updates, training time, and GPU memory usage.

The paper contributes a new long‑sequence benchmark, TriGap , comprising ten heterogeneous tasks (document VQA, chart QA, chemistry VQA, etc.) with varying data scales (10K–40K samples) and distinct instruction formats, to better expose forgetting in MCIT.

Experiments on TriGap, CoIN (eight tasks) and UCIT (six tasks) using LLaVA‑v1.5‑7B with LoRA rank 8 show that SAME consistently outperforms MoE‑LoRA and other strong baselines. Notable results include:

TriGap average accuracy 46.53 % vs. 44.45 % for MoE‑LoRA (+2.08 pp).

CoIN average accuracy 66.82 % vs. 63.95 % for HiDe‑LLaVA.

UCIT average accuracy 67.12 % vs. 65.52 % for ModalPrompt.

Detailed analyses demonstrate that spectral‑aware routing reduces the divergence of expert activation distributions for old tasks, while curvature‑aware scaling mitigates expert functional degradation, allowing higher post‑re‑routing accuracy. Adaptive expert activation saves ~32 min of training time and ~2.3 GiB GPU memory per task.

The authors also study format‑induced forgetting: after learning TextVQA, a baseline model outputs lowercase answers for ScienceQA (e.g., "a" instead of "A"), causing large accuracy drops. SAME substantially lowers this mismatch.

Qualitative case studies show that SAME preserves both the answer format and semantic correctness of earlier tasks, whereas the baseline drifts (e.g., changing "Lady" to "man" in a GQA example). The overall conclusion is that forgetting in MoE‑based MCIT stems from both routing and expert drift, and that the three SAME modules jointly provide a practical, low‑overhead solution for stable multimodal continual learning.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Mixture of Experts Continual Learning Multimodal Learning Instruction Tuning ICML 2026 SAME

Written by

Machine Learning Algorithms & Natural Language Processing

Focused on frontier AI technologies, empowering AI researchers' progress.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.