Artificial Intelligence 9 min read

ProMoE: Explicit Routing Breaks the Scaling Bottleneck of Diffusion‑Transformer MoE (ICLR 2026)

ProMoE introduces a two‑step routing MoE framework with explicit semantic guidance that tackles the high spatial redundancy and functional heterogeneity of visual tokens, enabling diffusion transformers to scale efficiently and outperform dense models and prior MoE approaches across generation, convergence, and scaling benchmarks.

Machine Heart

Mar 31, 2026

ProMoE: Explicit Routing Breaks the Scaling Bottleneck of Diffusion‑Transformer MoE (ICLR 2026)

Background and Problem

Mixture‑of‑Experts (MoE) has dramatically increased the capacity of large language models while keeping inference cost low. However, when MoE is applied to Diffusion Transformers (DiT), the performance gain is minimal. The authors identify two visual‑token properties that cause this mismatch: high spatial redundancy —image patches are densely coupled in space, leading to homogeneous expert features—and functional heterogeneity —the unconditional‑guidance (CFG) technique splits tokens into conditional and unconditional groups, which standard MoE treats uniformly.

To quantify these effects, they sampled 1k intermediate‑layer tokens from 110 ImageNet classes, performed k‑means clustering, and measured the inter‑/intra‑class distance ratio (19.283 ≫ 0.748), showing that visual tokens form diffuse clusters unlike LLM tokens. They also applied singular‑value decomposition to each MoE layer’s expert weight matrix and computed the average similarity of the top‑k singular vectors, finding that explicit routing substantially increases expert diversity.

ProMoE Design

ProMoE addresses the above issues with a two‑step router and routing contrastive learning (RCL) :

Conditional Routing : Tokens are first split by functional role. Unconditional image patches are hard‑routed to dedicated Unconditional Experts , while conditional tokens proceed to the next stage.

Prototypical Routing : A set of learnable prototypes is introduced, each linked to a specific expert. For a conditional token, the cosine similarity between its embedding and each prototype is computed; the identity activation yields a routing score, and the token is assigned to the expert with the highest score.

RCL provides explicit semantic guidance without manual labels. It consists of two forces:

Pull : Pulls each prototype toward the centroid of the tokens it routes, ensuring semantic cohesion within an expert.

Push : Pushes a prototype away from centroids of tokens routed to other experts, encouraging inter‑expert diversity and acting as a flexible load‑balancing mechanism.

The overall architecture (see Fig. 2) routes unconditional tokens directly, then distributes conditional tokens via prototype similarity, with RCL shaping the prototype space during training. The implementation is released at https://github.com/ali-vilab/ProMoE.

Experimental Evaluation

Model configurations span several scales (Base → XL) and expert counts (4 → 16). ProMoE‑L‑Flow, with only 1.063 B parameters, surpasses the larger Dense‑DiT‑XL‑Flow despite using fewer activation parameters.

Comparison with dense models : Across all sizes, ProMoE consistently outperforms the corresponding dense DiT baselines. Notably, ProMoE‑L‑Flow beats Dense‑DiT‑XL‑Flow, demonstrating superior efficiency.

Comparison with SOTA MoE models : ProMoE exceeds DiffMoE (1.846 B, 16 experts) while using only 1.063 B parameters, confirming the advantage of explicit routing.

Text‑to‑image benchmarks : On the GenEval suite, ProMoE outperforms the standard Token‑Choice MoE on every sub‑task, indicating strong generalization.

Convergence analysis : Training curves show that ProMoE converges noticeably faster than both dense and existing MoE models.

Scaling experiments : As model size grows from Base to XL and expert count rises from 4 to 16, generation quality improves steadily, evidencing robust scaling behavior.

Ablation studies (Fig. 4) isolate the contributions of conditional routing, prototypical routing, and RCL, confirming that each component adds measurable performance gains.

Conclusion

By dissecting the differences between language and visual tokens, the authors propose ProMoE—a MoE framework with explicit routing guidance that combines conditional routing, prototype‑based routing, and routing contrastive learning. ProMoE achieves higher generation quality, faster convergence, and better scalability than dense models and prior visual‑MoE methods, offering an open‑source blueprint for efficiently scaling large diffusion models. The full paper, “Routing Matters in MoE: Scaling Diffusion Transformers with Explicit Routing Guidance” (arXiv:2510.24711), provides additional technical details.