Artificial Intelligence 16 min read

Model Ability Gets Squeezed Out in Multi‑Task Learning—How ESM Preserves It (CVPR 2026)

The paper reveals that multi‑task models suffer performance drops because tasks compete for the same internal subspace, and introduces Essential Subspace Merging (ESM) which separates critical directions and uses Polarized Scaling to keep multiple abilities stable, achieving significantly lower degradation than traditional baselines.

Machine Learning Algorithms & Natural Language Processing

Apr 18, 2026

Model Ability Gets Squeezed Out in Multi‑Task Learning—How ESM Preserves It (CVPR 2026)

Problem: Subspace Interference in Multi‑Task Models

When a model that originally excels at a single task is extended with additional tasks, its performance becomes unstable: some abilities decline and results fluctuate. The root cause is that all tasks share the same internal representation space, so they compete for the most important subspace locations, leading to "capacity squeezing".

Essential Subspace Merging (ESM) Idea

Geng Xin’s team reframes model merging as a problem of locating and protecting the essential subspace where the most valuable information resides. Instead of complex parameter‑level fusion, they propose to (1) separate the important directions of each task to avoid overlap, and (2) retain the crucial information while suppressing less important components, enabling stable coexistence of multiple tasks within a single model.

Experimental Setup

The authors evaluate three core variables:

Subspace construction: compare Singular Value Decomposition (SVD) on the parameter space with Essential Subspace Decomposition (ESD) on the output space.

Fusion method: direct concatenation versus orthogonalization to reduce inter‑task correlation.

Weight allocation: unweighted merging versus norm‑based scaling, where scaling ∝ (norm / mean)<sup>2</sup> is applied at task, dimension, and hierarchy levels.

Tasks span highly heterogeneous domains (Cars, SUN397, SST2, MNIST) to amplify interference. Each task receives an equal rank allocation k = total dimension / number of tasks, ensuring fair representation capacity. Proxy data are deliberately limited to 32 unlabeled samples per task to test whether the subspace originates from the model itself rather than data statistics.

Results

Across increasing numbers of tasks, traditional baselines suffer a performance loss of 8 %–9 %, whereas ESM’s loss is markedly smaller, yielding roughly a 20 % lower overall degradation. On larger models that already achieve >90 % accuracy, ESM’s advantage narrows to 0.3 %–0.5 % but remains consistent.

Upper‑bound comparison: an un‑fine‑tuned model scores 50 %–65 %; a single‑task expert exceeds 90 %; ESM reaches 81 %–91 %, indicating it approaches the ideal of preserving single‑task performance after merging.

Ablation Studies

Replacing SVD with ESD improves performance from 89.0 % to 90.9 % (+1.9 %). Adding Polarized Scaling further raises it to 91.8 % (+0.9 %). The ESD component mainly mitigates information loss, while Polarized Scaling addresses competition between strong and weak signals.

Internal Mechanism Analysis

ESD retains more effective information even when only a small fraction of components is kept; with just 5 % of components, the fused model shows higher feature consistency with the expert model than SVD‑based fusion. This suggests that critical task knowledge concentrates in a few high‑impact directions rather than being uniformly distributed.

Data‑dependency experiments show that performance is robust to sampling strategy: using a single sample already outperforms the baseline, four samples approach optimal performance, and increasing to 32 samples yields convergence, confirming the low‑dimensional nature of the task subspace.

Implications

ESM demonstrates that multi‑task fusion can move from naïve parameter averaging to a principled re‑organization of knowledge structures, enabling models to acquire new abilities without eroding existing ones. This has practical relevance for building general‑purpose AI assistants that remain stable as new functions are added, and for reducing deployment costs by avoiding repeated full‑scale retraining.