Merging Large Language Models Without GPUs: Task Vector, SLERP, TIES & DARE Explained
This article introduces four advanced model‑merging algorithms—Task Vector, SLERP, TIES, and DARE—explains their underlying principles, compares their strengths, and demonstrates a practical merge of Mistral‑7B, WizardMath‑7B and CodeLlama‑7B using the open‑source MergeKit toolkit.
Model merging overview
Model merging combines multiple pretrained language models into a single model, preserving quality while adding new capabilities. The process can be performed on CPU only, without additional fine‑tuning.
Task Vector
Task Vector treats directions in a model’s weight space as vectors that encode improvements for a specific task. Adding or subtracting these vectors edits model behavior efficiently, enabling performance gains, bias reduction, and knowledge injection without full fine‑tuning.
Paper:
https://arxiv.org/abs/2212.04089SLERP
SLERP (Spherical Linear Interpolation) interpolates between two model weight vectors on the unit sphere, preserving each parent’s unique features and curvature.
Smooth transition between parameters.
Feature preservation for both models.
Geometric‑aware mixing that respects vector rotation.
SLERP workflow:
Normalize input vectors to unit length, focusing on direction.
Compute the angle between vectors via dot product and derive a scaling factor from the interpolation coefficient.
Weight and sum the original vectors with the scaling factor to obtain the interpolated vector.
Code repository:
https://github.com/Digitous/LLM-SLERP-MergeTIES
TIES mitigates parameter interference that degrades performance when merging many models. It performs three operations:
Reset parameters that changed only slightly during fine‑tuning, reducing redundancy.
Resolve sign conflicts across models.
Merge only parameters whose signs agree with the consensus.
Paper:
https://arxiv.org/abs/2306.01708DARE
DARE (Delta‑Aware Re‑scaling) extends TIES to merge models without extra training or GPU usage. It adds:
Delta‑parameter pruning: set to zero the majority of delta parameters (differences between fine‑tuned and pretrained weights) with minimal impact.
Weight re‑scaling: adjust merged weights to keep output expectations roughly unchanged.
DARE workflow:
Prune fine‑tuned weights back to their original pretrained values.
Average parameters from multiple models to create a unified model.
Re‑scale the merged weights to preserve expected performance.
Paper:
https://arxiv.org/abs/2311.03099Merge demonstration with MergeKit
Installation
python3 -m pip install --upgrade pip
git clone https://github.com/cg123/mergekit.git
cd mergekit && pip install -q -e .YAML configuration for merging Mistral‑7B, WizardMath‑7B, and CodeLlama‑7B using the TIES method
models:
- model: mistralai/Mistral-7B-v0.1
- model: WizardLM/WizardMath-7B-V1.0
parameters:
density: 0.5
weight:
- filter: mlp
value: 0.5
- value: 0
- model: codellama/CodeLlama-7b-Instruct-hf
parameters:
density: 0.5
weight: 0.5
merge_method: ties
base_model: mistralai/Mistral-7B-v0.1
parameters:
normalize: true
int8_mask: true
dtype: float16Running the merge
mergekit-yaml ultra_llm_merged.yaml output_folder \
--allow-crimes \
--copy-tokenizer \
--out-shard-size 1B \
--low-cpu-memory \
--write-model-card \
--lazy-unpickleResource usage on a 30‑vCPU machine (values may vary with model size):
Download: ~5 minutes
Merge: ~7 minutes
Peak memory: 30 GB
MergeKit repository:
https://github.com/cg123/mergekitBaobao Algorithm Notes
Author of the BaiMian large model, offering technology and industry insights.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
