Unlocking Large Model Power: 5 Effective Model Fusion Techniques Explained
This article examines why ensemble methods are crucial for large language models, outlines five core fusion strategies—including model integration, probability integration, graft learning, crowdsourced voting, and Mixture of Experts—provides implementation details, pseudo‑code, and discusses practical challenges and recent research advances.
Why Model Fusion Matters for LLMs
Ensemble techniques improve discriminative models, but applying them to generative large language models (LLMs) is challenging because decoding introduces token‑level dependencies. As model parameter counts grow, classic ensemble methods such as stacking or boosting become impractical, requiring adaptations that respect vocabulary alignment and massive parameter scales.
Five Fundamental Fusion Approaches
1. Model Integration (Exchange‑of‑Thought)
Model integration concatenates the textual outputs of several LLMs (e.g., three different LLaMA variants) and feeds the combined text as a prompt to a fourth model. The "Exchange‑of‑Thought" (EoT) framework formalizes this cross‑model communication, allowing models to absorb each other's reasoning steps to improve collective problem solving.
2. Probability Integration
Analogous to traditional ensemble averaging, probability integration averages the logits (or probability distributions) of multiple models. All participating models must share the same vocabulary so that logits are comparable.
Simple pseudo‑code:
kv_cache = None
while True:
input_ids = torch.tensor([[new_token]], dtype=torch.long, device='cuda')
kv_cache1, kv_cache2 = kv_cache
output1 = models[0](input_ids=input_ids, past_key_values=kv_cache1, use_cache=True)
output2 = models[1](input_ids=input_ids, past_key_values=kv_cache2, use_cache=True)
kv_cache = [output1.past_key_values, output2.past_key_values]
prob = (output1.logits + output2.logits) / 2
new_token = torch.argmax(prob, dim=-1).item()3. Graft Learning
Graft learning, introduced in the SOLAR paper ("SOLAR 10.7B: Scaling Large Language Models with Simple yet Effective Depth Up‑Scaling"), grafts layers from a base model into a deeper model and continues pre‑training. Concretely, a base model with n layers is duplicated; the last m layers are removed from both copies, leaving two n‑m ‑layer sub‑models. These are concatenated to form a 2·(n‑m) -layer model, which requires less compute than training from scratch. After grafting, the combined model is aligned via instruction fine‑tuning and Direct Preference Optimization (DPO) to recover performance.
4. Crowdsourced Voting
In the winning solution of the 2024 WSDM Cup, a crowdsourced voting scheme selects the generated sentence that is most similar to all other model outputs as the consensus answer. Similarity can be measured with embedding cosine similarity, word‑level ROUGE‑L, or character‑level ROUGE‑L. The aggregated similarity score serves as a quality metric for final selection. Code repository: https://github.com/zhangzhao219/WSDM-Cup-2024/tree/main
5. Mixture of Experts (MoE)
MoE combines multiple expert sub‑models with a gating network that dynamically routes token batches to a subset of experts. Modern large‑scale Transformers replace each feed‑forward network with a Top‑2 gated MoE layer (e.g., GShard), enabling models with billions of parameters while keeping inference cost comparable to much smaller dense models. Key techniques in GShard:
Auxiliary load‑balancing loss : penalizes imbalanced token distribution across experts.
Random routing : after selecting the top‑1 expert, the second expert is chosen probabilistically based on its weight.
Expert capacity limits : caps the number of tokens an expert can process; overflow tokens are passed to the next layer via residual connections or dropped.
During inference only a subset of experts is activated; shared components such as self‑attention remain dense, allowing a 47B‑parameter MoE model to run with roughly the compute of a 12B dense model. Generic MoE layer pseudo‑code:
M = input.shape[-1]
reshaped_input = input.reshape(-1, M)
# Compute gating probabilities
gates = softmax(einsum("SM, ME -> SE", reshaped_input, Wg))
combine_weights, dispatch_mask = Top2Gating(gates)
# Dispatch inputs to experts
dispatched_expert_input = einsum("SEC, SM -> ECM", dispatch_mask, reshaped_input)
# Expert forward passes
h = einsum("ECM, EMH -> ECH", dispatched_expert_input, Wi)
h = relu(h)
expert_outputs = einsum("ECH, EHM -> ECM", h, Wo)
# Combine expert outputs
outputs = einsum("SEC, ECM -> SM", combine_weights, expert_outputs)
outputs_reshape = outputs.reshape(input.shape)Later variants simplify gating: Switch Transformers use a Top‑1 strategy, while DeepSeek MoE introduces a shared expert that is always active, guaranteeing universal knowledge for every token. Effective model fusion for LLMs therefore adapts classic ensemble ideas to token‑level operations, ensures vocabulary alignment, manages massive parameter counts, and leverages dynamic gating mechanisms to balance performance and efficiency.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Baobao Algorithm Notes
Author of the BaiMian large model, offering technology and industry insights.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
