Industry Insights 13 min read

MoE vs MoR: Deep Dive into Expert and Recursive Mixture Architectures for LLMs

This article provides a comprehensive technical comparison between Mixture of Experts (MoE) and the newly proposed Mixture of Recursion (MoR) architectures, covering design principles, parameter efficiency, inference latency, training stability, routing mechanisms, hardware deployment considerations, and suitable application scenarios.

Data Party THU
Data Party THU
Data Party THU
MoE vs MoR: Deep Dive into Expert and Recursive Mixture Architectures for LLMs

Mixture of Experts (MoE) Architecture

MoE splits a language model into a shared base layer and a set of expert modules, each expert being a small feed‑forward network trained on a subset of the data. During inference a learned router selects a few experts (typically 2–4) for each token, activating only those experts while the rest remain idle. This sparse activation enables the model to contain hundreds of billions of parameters while keeping the per‑token compute comparable to a dense model of a few billion parameters.

Mixture of Recursion (MoR) Architecture

MoR uses a single lightweight Transformer block that is applied repeatedly to the same token sequence. Each token decides autonomously how many recursion steps it needs: simple tokens exit after a few iterations, whereas complex tokens undergo more passes. The same weights are reused across all iterations, so the model’s total parameter count stays modest (e.g., ~118 M) while depth varies per token.

Design Comparison

MoE expands model capacity by adding many parallel experts (width) and routing tokens to a subset of them. MoR expands capacity by increasing the effective depth for difficult tokens while keeping a narrow, shared computation graph. Consequently, MoE graphs are highly branched and require sparse‑tensor support, whereas MoR graphs are linear and easier to optimise.

Architecture comparison
Architecture comparison

Parameter Efficiency

In practice a MoE model that behaves like a 1.3 B‑parameter dense model may contain >100 B parameters across all experts, incurring high storage and training costs. By contrast, MoR reuses a single block, so a 118 M‑parameter MoR model can outperform a 300 M‑parameter dense Transformer on few‑shot benchmarks, demonstrating superior parameter efficiency.

Inference Latency

MoE’s sparse activation leads to fragmented memory accesses, load‑imbalance across devices, and cross‑GPU communication for routing, which can erase theoretical speed gains on modest hardware. MoR avoids routing and inter‑device coordination; each token follows a predictable memory access pattern and can run efficiently on mid‑range GPUs.

Training Stability

MoE training suffers from expert collapse: some experts receive few gradients and fail to learn. Mitigations include auxiliary load‑balancing losses, entropy regularisation, and careful expert‑capacity allocation, which increase training complexity. MoR eliminates expert imbalance because only one set of weights is trained, resulting in more stable convergence. The main tuning challenge for MoR is selecting the optimal recursion depth per token.

Routing Mechanisms

MoE employs a learned router that predicts a probability distribution over experts from token embeddings; the router is trained end‑to‑end and must be balanced to avoid over‑use of a few experts. MoR’s routing is lightweight: either a token‑level decision at each recursion step (continue or exit) or a fixed depth assigned at the start based on the token’s initial representation. This routing focuses on when to stop rather than which module to use.

Hardware Adaptation and Deployment

MoE requires high‑speed GPU interconnects, sparse‑tensor kernels, and custom framework extensions (e.g., DeepSpeed MoE, GShard). It is therefore suited to large‑scale clusters owned by major AI labs. MoR builds on standard Transformer primitives and adds only a loop‑control mechanism, making it deployable with vanilla PyTorch or JAX on single‑GPU servers, edge devices, or cloud instances without special hardware.

Typical Application Scenarios

MoE: Training massive multi‑task or multilingual models where the primary goal is maximal capacity and the organization can afford extensive engineering effort and high‑end hardware.

MoR: Scenarios that prioritise low inference latency, modest memory footprint, and easy integration—such as model fine‑tuning, few‑shot learning, on‑device inference, or services running on commodity GPUs.

Conclusion

MoE and MoR represent two orthogonal scaling strategies for large language models. MoE achieves capacity by adding many specialised experts with sparse activation, at the cost of storage, routing complexity, and hardware requirements. MoR achieves efficiency by reusing a single block with dynamic depth, offering stable latency and simpler deployment. The appropriate choice depends on the target application’s performance goals, resource constraints, and engineering capabilities.

Code example

来源:DeepHub IMBA
本文
约3600字
,建议阅读
10
分钟
选择哪种架构应基于具体的应用需求、资源约束和技术能力进行综合考量。
Mixture of Expertsmodel architectureinference performancetraining stabilityHardware DeploymentMixture of Recursion
Data Party THU
Written by

Data Party THU

Official platform of Tsinghua Big Data Research Center, sharing the team's latest research, teaching updates, and big data news.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.