Demystifying Mixture of Experts: How MoE Boosts LLMs and Vision Models
This article explains the Mixture of Experts (MoE) architecture, detailing experts, routers, dense vs. sparse layers, load‑balancing strategies such as KeepTopK, auxiliary loss, capacity constraints, the Switch Transformer simplification, and how MoE is applied to both language and vision models, illustrated with concrete examples and parameter counts.
Mixture of Experts (MoE) Overview
Mixture of Experts is a modular architecture for large neural networks, especially Transformers, that replaces the dense feed‑forward network (FFN) in each layer with a set of experts —independent FFNs—and a router (gate network) that decides which experts process each input token. Only a small subset of experts is activated per token, reducing compute while increasing model capacity.
Expert definition
In a standard Transformer, each layer contains a dense FFN that applies the same weight matrix to every token. In an MoE layer the dense FFN is split into n experts, each with its own parameters. During a forward pass only k experts (often k=1 ) are selected for each token, so the layer becomes sparse . The total number of parameters grows with the number of experts, but the runtime cost grows only with the number of active experts.
Routing mechanism
The router is itself a lightweight FFN that maps a token representation x to a set of logits: logits = x \times W_{router} These logits are transformed into a probability distribution with a softmax: g = softmax(logits) For each token the router keeps the top‑ k probabilities (top‑k routing) and sets the rest to -\infty. The selected experts receive the token, and their outputs are weighted by the corresponding gate values g_i and summed:
output = \sum_{i \in top_k} g_i \times Expert_i(x)Load‑balancing strategies
Without additional constraints the router may repeatedly select the same experts, causing under‑utilisation. Two common techniques are used:
KeepTopK : adds trainable Gaussian noise to the logits and forces all non‑selected experts to have -\infty logits, guaranteeing that only the top‑k receive non‑zero probability.
Auxiliary (load‑balancing) loss : encourages uniform expert utilisation. For a batch, the router values per expert are summed to obtain importance scores S_i. The coefficient of variation (CV) is computed as CV = std(S) / mean(S). The auxiliary loss penalises high CV, e.g.: L_{bal} = \lambda \times CV^2 where \lambda is a hyper‑parameter that balances this term against the main task loss.
Expert capacity and token overflow
Each expert has a maximum number of tokens it can process, defined by a capacity factor c_f:
capacity = c_f \times \frac{batch\_size \times seq\_len}{num\_experts}If an expert reaches its capacity, excess tokens are routed to other experts. When all experts are full, the overflow tokens are sent to the next layer (or dropped), preventing any single expert from dominating training.
Switch Transformer: a simplified MoE
The Switch Transformer replaces the standard FFN with a sparse MoE layer that uses top‑1 routing (each token is sent to exactly one expert). It introduces a capacity factor to control per‑expert token limits and a simplified auxiliary loss that directly matches the token‑to‑expert allocation ratio f_i with the router probability ratio p_i:
L_{switch} = \alpha \sum_i \left( \frac{f_i}{\sum_j f_j} - \frac{p_i}{\sum_j p_j} \right)^2 \alphais a tunable weight. This loss is cheaper to compute than the full CV‑based loss and works well with the Switch architecture.
Applying MoE to Vision Transformers (Vision‑MoE)
Vision Transformers treat image patches as tokens. By swapping the dense FFN in each encoder block with a sparse MoE, the same routing and capacity mechanisms can be used for visual data. Because many patches are low‑importance, Vision‑MoE often employs priority routing :
Each patch receives an importance score from a lightweight scorer.
Patches with higher scores are preferentially routed to experts, while low‑score patches may overflow to the next layer.
This allows a relatively small ViT model to scale to billions of parameters without a proportional increase in compute.
Active vs. sparse parameters (Mixtral 8×7B example)
MoE models store parameters for all experts (sparse parameters) but only a subset is active during inference. In Mixtral 8×7B:
Each expert contains ~5.6 billion parameters.
The model has 8 experts per MoE layer, giving a total of ≈ 46.8 B sparse parameters (including shared transformer weights).
During inference only 2 experts are active (top‑1 routing per token, two‑way parallelism), so the active parameter count is ≈ 11.2 B .
Consequently the full model must be loaded into memory, but the runtime cost is comparable to a dense model with far fewer parameters.
References
https://newsletter.maartengrootendorst.com/p/a-visual-guide-to-mixture-of-experts
Architect
Professional architect sharing high‑quality architecture insights. Topics include high‑availability, high‑performance, high‑stability architectures, big data, machine learning, Java, system and distributed architecture, AI, and practical large‑scale architecture case studies. Open to ideas‑driven architects who enjoy sharing and learning.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
