OrthoReg: Simple Orthogonal Regularization to Eliminate Model Merging Conflicts

The paper introduces OrthoReg, a lightweight orthogonal regularization added during fine‑tuning that provably enforces weight orthogonality, thereby resolving conflicts in model merging and providing a theoretical explanation for the success of task arithmetic.

Machine Heart
Machine Heart
Machine Heart
OrthoReg: Simple Orthogonal Regularization to Eliminate Model Merging Conflicts

In the era of large models, fine‑tuning adapts a pretrained backbone to downstream tasks, but merging several fine‑tuned expert models traditionally requires costly joint training and can cause severe interference.

Task arithmetic offers a cheap alternative: by subtracting the pretrained weights from a fine‑tuned model, a Task Vector is obtained, and simple algebraic addition of such vectors yields a single model capable of handling multiple tasks.

Previous work (NeurIPS 2023 Tangent Task Arithmetic) explained this phenomenon with the notion of Weight Disentanglement , stating that when weight updates for different tasks do not interfere, task arithmetic succeeds. However, it left open which intrinsic properties of the pretrained model enable this disentanglement.

The authors propose the Task‑Feature Specialization (TFS) hypothesis: an ideal pretrained model allocates distinct internal features (represented by column vectors of weight matrices) to each task, making the features for, e.g., car detection and digit recognition mutually independent. Under the NTK linearisation assumption they prove:

Theorem 1: TFS is a sufficient condition for weight disentanglement.

Corollary 1: TFS implies an observable geometric property—Weight Vector Orthogonality (WVO)—where weight matrices exhibit block‑ or column‑wise orthogonal structure.

Empirically, CLIP vision models (ViT‑B/16, ViT‑B/32, ViT‑L/14) show that the angles between weight vectors in core projection layers concentrate around 90°, confirming the orthogonality prediction.

In practice, perfect TFS rarely holds because task data often share overlapping features. Consequently, the authors adopt a pre‑merging strategy: instead of enforcing functional specialization, they directly regularise the geometric consequence—orthogonality—during fine‑tuning.

The proposed method, OrthoReg , adds a single orthogonal regularisation term to the standard loss, penalising the inner product between the weight‑update matrix ΔW and the identity matrix, thereby encouraging each updated linear layer to remain orthogonal. This term is lightweight (one extra hyper‑parameter) and requires no architectural changes.

Compared with TTA, which relies on expensive Jacobian computations, OrthoReg incurs negligible additional computation while achieving the same theoretical goal of orthogonal task vectors.

Experiments on eight diverse image‑classification datasets using Vision Transformers (ViT‑B‑32, ViT‑B‑16, ViT‑L‑14) evaluate three fine‑tuning regimes: full‑parameter (Non‑lin. FT), tangent‑space (TTA), and parameter‑efficient (ATT‑FT, LoRA). Adding OrthoReg consistently improves performance. For example, on ViT‑L‑14, OrthoReg raises full‑parameter fine‑tuning accuracy from 84.07 % to 88.23 % (+4.16 pp), and ATT‑FT + OrthoReg reaches a new high of 90.41 %.

Task‑negation experiments show that OrthoReg enables cleaner removal of a target task: the accuracy on the removed task drops more sharply while preserving zero‑shot generalisation on a control task such as ImageNet.

Visualization of cosine‑similarity heatmaps of task vectors demonstrates that baseline methods produce highly correlated vectors (bright off‑diagonal blocks), whereas OrthoReg yields darker off‑diagonal regions, confirming that the regulariser drives vectors toward orthogonality.

In summary, the work provides a new theoretical link—TFS → weight disentanglement → orthogonal weight vectors—and translates it into a practical, plug‑and‑play regulariser that improves multi‑task model merging across architectures and fine‑tuning paradigms. Future directions include richer orthogonal constraints for more complex multi‑task settings and extending the geometric disentanglement perspective to large language and multimodal models.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

deep learningVision Transformersmodel mergingOrthogonal RegularizationOrthoRegTask Arithmetic
Machine Heart
Written by

Machine Heart

Professional AI media and industry service platform

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.