MoE vs MoR: Deep Dive into Expert and Recursive Mixture Architectures for LLMs
This article provides a comprehensive technical comparison between Mixture of Experts (MoE) and the newly proposed Mixture of Recursion (MoR) architectures, covering design principles, parameter efficiency, inference latency, training stability, routing mechanisms, hardware deployment considerations, and suitable application scenarios.
Mixture of Experts (MoE) Architecture
MoE splits a language model into a shared base layer and a set of expert modules, each expert being a small feed‑forward network trained on a subset of the data. During inference a learned router selects a few experts (typically 2–4) for each token, activating only those experts while the rest remain idle. This sparse activation enables the model to contain hundreds of billions of parameters while keeping the per‑token compute comparable to a dense model of a few billion parameters.
Mixture of Recursion (MoR) Architecture
MoR uses a single lightweight Transformer block that is applied repeatedly to the same token sequence. Each token decides autonomously how many recursion steps it needs: simple tokens exit after a few iterations, whereas complex tokens undergo more passes. The same weights are reused across all iterations, so the model’s total parameter count stays modest (e.g., ~118 M) while depth varies per token.
Design Comparison
MoE expands model capacity by adding many parallel experts (width) and routing tokens to a subset of them. MoR expands capacity by increasing the effective depth for difficult tokens while keeping a narrow, shared computation graph. Consequently, MoE graphs are highly branched and require sparse‑tensor support, whereas MoR graphs are linear and easier to optimise.
Parameter Efficiency
In practice a MoE model that behaves like a 1.3 B‑parameter dense model may contain >100 B parameters across all experts, incurring high storage and training costs. By contrast, MoR reuses a single block, so a 118 M‑parameter MoR model can outperform a 300 M‑parameter dense Transformer on few‑shot benchmarks, demonstrating superior parameter efficiency.
Inference Latency
MoE’s sparse activation leads to fragmented memory accesses, load‑imbalance across devices, and cross‑GPU communication for routing, which can erase theoretical speed gains on modest hardware. MoR avoids routing and inter‑device coordination; each token follows a predictable memory access pattern and can run efficiently on mid‑range GPUs.
Training Stability
MoE training suffers from expert collapse: some experts receive few gradients and fail to learn. Mitigations include auxiliary load‑balancing losses, entropy regularisation, and careful expert‑capacity allocation, which increase training complexity. MoR eliminates expert imbalance because only one set of weights is trained, resulting in more stable convergence. The main tuning challenge for MoR is selecting the optimal recursion depth per token.
Routing Mechanisms
MoE employs a learned router that predicts a probability distribution over experts from token embeddings; the router is trained end‑to‑end and must be balanced to avoid over‑use of a few experts. MoR’s routing is lightweight: either a token‑level decision at each recursion step (continue or exit) or a fixed depth assigned at the start based on the token’s initial representation. This routing focuses on when to stop rather than which module to use.
Hardware Adaptation and Deployment
MoE requires high‑speed GPU interconnects, sparse‑tensor kernels, and custom framework extensions (e.g., DeepSpeed MoE, GShard). It is therefore suited to large‑scale clusters owned by major AI labs. MoR builds on standard Transformer primitives and adds only a loop‑control mechanism, making it deployable with vanilla PyTorch or JAX on single‑GPU servers, edge devices, or cloud instances without special hardware.
Typical Application Scenarios
MoE: Training massive multi‑task or multilingual models where the primary goal is maximal capacity and the organization can afford extensive engineering effort and high‑end hardware.
MoR: Scenarios that prioritise low inference latency, modest memory footprint, and easy integration—such as model fine‑tuning, few‑shot learning, on‑device inference, or services running on commodity GPUs.
Conclusion
MoE and MoR represent two orthogonal scaling strategies for large language models. MoE achieves capacity by adding many specialised experts with sparse activation, at the cost of storage, routing complexity, and hardware requirements. MoR achieves efficiency by reusing a single block with dynamic depth, offering stable latency and simpler deployment. The appropriate choice depends on the target application’s performance goals, resource constraints, and engineering capabilities.
Code example
来源:DeepHub IMBA
本文
约3600字
,建议阅读
10
分钟
选择哪种架构应基于具体的应用需求、资源约束和技术能力进行综合考量。Data Party THU
Official platform of Tsinghua Big Data Research Center, sharing the team's latest research, teaching updates, and big data news.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
