Breaking the Echo Chamber: MP‑MoE Introduces Ensemble‑Pruning for Diverse Experts

The paper presents MP‑MoE, a new Mixture‑of‑Experts architecture that replaces top‑k routing with Mahalanobis‑based ensemble pruning, explicitly encouraging expert diversity via a co‑occurrence matrix, and uses an efficient greedy algorithm with incremental Cholesky updates, achieving higher performance with minimal training overhead and no inference cost.

Machine Heart
Machine Heart
Machine Heart
Breaking the Echo Chamber: MP‑MoE Introduces Ensemble‑Pruning for Diverse Experts

MP‑MoE Overview

Mixture‑of‑Experts (MoE) has become a key scaling technique for large models, but standard top‑k routing often leads to an "echo chamber" where high‑scoring experts are repeatedly selected, causing their representations to converge. The authors propose Mahalanobis‑Pruned MoE (MP‑MoE) to address this issue.

How MP‑MoE Works

Step 1: Mahalanobis Ensemble Routing – Instead of selecting the top‑k experts solely by gating scores, MP‑MoE formulates expert selection as an ensemble‑pruning problem. It maximizes the Mahalanobis norm of the chosen expert subset, thereby rewarding high‑confidence experts while penalizing redundancy.

Step 2: Expert Co‑occurrence Matrix – To measure similarity without activating all experts, the method treats the selection of each expert as a Bernoulli variable and estimates a covariance matrix from the co‑occurrence counts of experts during training. This matrix captures how often pairs of experts are jointly activated, providing a cheap proxy for expert similarity.

Step 3: Greedy Subset Optimization – Solving the exact subset selection is intractable, so MP‑MoE uses a greedy algorithm that evaluates the marginal gain of adding each candidate expert, considering both its routing score and its redundancy with already‑selected experts. Incremental Cholesky updates are employed to avoid recomputing matrix inverses, dramatically reducing computational complexity while offering theoretical approximation guarantees.

Experimental Results

Linear CKA is used to quantify expert output similarity. Across layers 2, 5, and 9, standard MoE shows CKA values of 0.43, 0.36, and 0.37, whereas MP‑MoE reduces them to 0.31, 0.28, and 0.30, indicating substantially lower overlap. PCA visualizations confirm that MP‑MoE experts produce more separated output distributions.

Benchmarking on multiple downstream tasks demonstrates a consistent 1–3 % improvement over standard MoE under the same pre‑training budget. Training overhead is modest: with 64 experts and top‑k = 8, MP‑MoE adds roughly 3 % extra FLOPs and wall‑clock time, while inference remains unchanged because the standard top‑k routing is retained.

Conclusion

MP‑MoE demonstrates that MoE routing should prioritize a complementary expert set rather than merely the highest scores. By linking MoE routing to ensemble pruning, leveraging a co‑occurrence matrix for similarity estimation, and employing an efficient greedy optimizer, MP‑MoE improves expert diversity, incurs only slight training overhead, and leaves inference cost untouched, offering a practical path for scaling sparse large models.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

Mixture-of-ExpertsDynamic RoutingMahalanobis DistanceICML 2026Ensemble PruningExpert DiversityMP-MoE
Machine Heart
Written by

Machine Heart

Professional AI media and industry service platform

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.