Uncovering Mixtral‑8x7B: How MoE Experts Shape Performance and Training
This article analyses the Mixtral‑8x7B Mixture‑of‑Experts LLM, explains its gate‑driven 8‑expert architecture, presents a simplified PyTorch implementation, and reports a series of experiments that probe top‑2 gating during training, individual expert contributions, task‑specific pre‑training, the impact of expert count, and similarity with Mistral‑7B, ultimately offering hypotheses about its training pipeline.
Introduction
MistralAI released the Mixtral‑8x7B model, a Mixture‑of‑Experts (MoE) LLM that claims to outperform LLaMA‑2 70B and approach GPT‑3.5 despite having only eight 7B‑parameter experts. The community quickly benchmarked the model and began dissecting its architecture.
Model Structure
The only structural difference from LLaMA is that the MLP layer is replicated eight times as separate expert layers. A gate layer selects the top‑2 experts for each token. The simplified implementation is shown below.
# Note: simplified code, batch size omitted for clarity
class MixtralSparseMoeBlock(nn.Module):
def __init__(self, config):
super().__init__()
self.gate = nn.Linear(self.hidden_dim, 8)
self.experts = nn.ModuleList([MLP(config) for _ in range(8)])
def forward(self, x):
router_logits = self.gate(x)
routing_weights = F.softmax(router_logits, dim=1)
routing_weights, selected_experts = torch.top2(routing_weights, dim=-1)
expert_mask = torch.nn.functional.one_hot(selected_experts, num_classes=8)
routing_weights /= routing_weights.sum(dim=-1, keepdim=True)
final_hidden_states = torch.zeros_like(x)
for expert_idx in range(8):
expert_layer = self.experts[expert_idx]
idx_list, top_x_list = torch.where(expert_mask[expert_idx])
current_state = x[top_x_list]
current_routing_weights = routing_weights.t()[top_x_list, idx_list]
current_hidden_states = expert_layer(current_state) * current_routing_weights
final_hidden_states.index_add_(0, top_x_list, current_hidden_states)
return final_hidden_statesExperiment 1: Is Top‑2 Gating Used During Training?
To test whether training used all experts, the number of activated experts was varied (top‑1, top‑2, top‑3, etc.) and the model was evaluated on the MMLU benchmark. Results showed a sharp drop for top‑1, a slight rise for top‑3, and degradation when more than two experts were activated, indicating that the model was indeed trained with top‑2 gating.
Experiment 2: Do All Experts Contribute Equally?
Each expert was removed in turn and the resulting model was evaluated on MMLU. Deleting expert 3 caused a catastrophic failure, while removing other experts produced only modest performance changes, suggesting poor load‑balancing and that most experts have non‑trivial contributions.
Experiment 3: Are Experts Pre‑trained on Different Tasks?
The same removal test was performed on the MT‑Bench benchmark (and later on Math and Code tasks). The performance pattern mirrored that of MMLU, implying that the experts were not specialized for distinct downstream tasks.
Experiment 4: How Does the Number of Experts Affect Performance?
A greedy algorithm removed experts one by one, recording the best‑performing sub‑model after each deletion. The deletion order [2, 5, 6, 5, 7, 0, 1, 3] reflects relative importance; expert 3 is consistently critical. The performance curve for models with 1‑7 experts is shown below.
7x7b 0.5937 0.5813 0.5873 0.5748 0.5790 0.5564 0.5179 0.0040
6x7b — 0.5448 0.5422 0.5417 0.5389 0.5359 0.4671 0.0296
5x7b — — 0.4920 0.4827 0.4762 0.4674 0.3490 0.0004
4x7b — — — 0.4178 0.4138 0.3988 0.2918 0.0002
3x7b — — — — 0.3553 0.3288 0.2723 0.2524
2x7b — — — — — 0.2760 0.2624 0.2510
1x7b — — — — — — 0.2408 0.0028Experiment 5: Was the Model Built from Early‑Stage Checkpoints?
Cosine similarity of attention QKV matrices between Mixtral‑8x7B and Mistral‑7B was high, suggesting the attention layers were copied from Mistral. FFN similarity between each Mixtral expert and Mistral‑7B was around 40 %, indicating that the FFN experts were also derived from Mistral but further diverged. Expert 3 showed the lowest similarity, hinting at a unique role.
Hypothesis: Mixtral‑8x7B was likely constructed by taking early‑stage checkpoints of Mistral‑7B, copying the attention layers directly, replicating the FFN eight times, adding a gating layer, and then continuing pre‑training. This explains the observed similarity patterns and the model’s fragility when certain experts are removed.
Conclusion
The analysis introduced Mixtral‑8x7B’s MoE architecture, demonstrated that training uses top‑2 gating, revealed uneven expert contributions (especially the critical role of expert 3), showed that reducing the number of experts degrades performance, and provided evidence that the model was built from early‑stage Mistral checkpoints. The findings suggest avenues for future fine‑tuning of sub‑models with fewer experts.
Baobao Algorithm Notes
Author of the BaiMian large model, offering technology and industry insights.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
