Artificial Intelligence 16 min read

Unlocking Large Model Power: 5 Effective Model Fusion Techniques Explained

This article examines why ensemble methods are crucial for large language models, outlines five core fusion strategies—including model integration, probability integration, graft learning, crowdsourced voting, and Mixture of Experts—provides implementation details, pseudo‑code, and discusses practical challenges and recent research advances.

Baobao Algorithm Notes

Mar 10, 2024

Unlocking Large Model Power: 5 Effective Model Fusion Techniques Explained

Why Model Fusion Matters for LLMs

Ensemble techniques improve discriminative models, but applying them to generative large language models (LLMs) is challenging because decoding introduces token‑level dependencies. As model parameter counts grow, classic ensemble methods such as stacking or boosting become impractical, requiring adaptations that respect vocabulary alignment and massive parameter scales.

Five Fundamental Fusion Approaches

1. Model Integration (Exchange‑of‑Thought)

Model integration concatenates the textual outputs of several LLMs (e.g., three different LLaMA variants) and feeds the combined text as a prompt to a fourth model. The "Exchange‑of‑Thought" (EoT) framework formalizes this cross‑model communication, allowing models to absorb each other's reasoning steps to improve collective problem solving.

2. Probability Integration

Analogous to traditional ensemble averaging, probability integration averages the logits (or probability distributions) of multiple models. All participating models must share the same vocabulary so that logits are comparable.

Simple pseudo‑code:

kv_cache = None
while True:
    input_ids = torch.tensor([[new_token]], dtype=torch.long, device='cuda')
    kv_cache1, kv_cache2 = kv_cache
    output1 = models[0](input_ids=input_ids, past_key_values=kv_cache1, use_cache=True)
    output2 = models[1](input_ids=input_ids, past_key_values=kv_cache2, use_cache=True)
    kv_cache = [output1.past_key_values, output2.past_key_values]
    prob = (output1.logits + output2.logits) / 2
    new_token = torch.argmax(prob, dim=-1).item()

3. Graft Learning

Graft learning, introduced in the SOLAR paper ("SOLAR 10.7B: Scaling Large Language Models with Simple yet Effective Depth Up‑Scaling"), grafts layers from a base model into a deeper model and continues pre‑training. Concretely, a base model with n layers is duplicated; the last m layers are removed from both copies, leaving two n‑m ‑layer sub‑models. These are concatenated to form a 2·(n‑m) -layer model, which requires less compute than training from scratch. After grafting, the combined model is aligned via instruction fine‑tuning and Direct Preference Optimization (DPO) to recover performance.

4. Crowdsourced Voting

In the winning solution of the 2024 WSDM Cup, a crowdsourced voting scheme selects the generated sentence that is most similar to all other model outputs as the consensus answer. Similarity can be measured with embedding cosine similarity, word‑level ROUGE‑L, or character‑level ROUGE‑L. The aggregated similarity score serves as a quality metric for final selection. Code repository: https://github.com/zhangzhao219/WSDM-Cup-2024/tree/main

5. Mixture of Experts (MoE)

MoE combines multiple expert sub‑models with a gating network that dynamically routes token batches to a subset of experts. Modern large‑scale Transformers replace each feed‑forward network with a Top‑2 gated MoE layer (e.g., GShard), enabling models with billions of parameters while keeping inference cost comparable to much smaller dense models. Key techniques in GShard:

Auxiliary load‑balancing loss : penalizes imbalanced token distribution across experts.

Random routing : after selecting the top‑1 expert, the second expert is chosen probabilistically based on its weight.

Expert capacity limits : caps the number of tokens an expert can process; overflow tokens are passed to the next layer via residual connections or dropped.

During inference only a subset of experts is activated; shared components such as self‑attention remain dense, allowing a 47B‑parameter MoE model to run with roughly the compute of a 12B dense model. Generic MoE layer pseudo‑code:

M = input.shape[-1]
reshaped_input = input.reshape(-1, M)
# Compute gating probabilities
 gates = softmax(einsum("SM, ME -> SE", reshaped_input, Wg))
combine_weights, dispatch_mask = Top2Gating(gates)
# Dispatch inputs to experts
 dispatched_expert_input = einsum("SEC, SM -> ECM", dispatch_mask, reshaped_input)
# Expert forward passes
 h = einsum("ECM, EMH -> ECH", dispatched_expert_input, Wi)
 h = relu(h)
 expert_outputs = einsum("ECH, EHM -> ECM", h, Wo)
# Combine expert outputs
 outputs = einsum("SEC, ECM -> SM", combine_weights, expert_outputs)
outputs_reshape = outputs.reshape(input.shape)

Later variants simplify gating: Switch Transformers use a Top‑1 strategy, while DeepSeek MoE introduces a shared expert that is always active, guaranteeing universal knowledge for every token. Effective model fusion for LLMs therefore adapts classic ensemble ideas to token‑level operations, ensures vocabulary alignment, manages massive parameter counts, and leverages dynamic gating mechanisms to balance performance and efficiency.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Mixture of Experts Model Fusion AI research ensemble methods probability integration

Written by

Baobao Algorithm Notes

Author of the BaiMian large model, offering technology and industry insights.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.