RouteMoA: Dynamic Routing Without Pre‑Inference for Efficient Multi‑Agent Mixtures

RouteMoA moves model selection ahead of inference by using a lightweight scorer to predict each model's suitability from the query, dramatically cutting computation cost and latency while preserving or improving accuracy, as demonstrated on a 15‑model pool with up to 90% cost reduction and 64% latency reduction.

Machine Learning Algorithms & Natural Language Processing
Machine Learning Algorithms & Natural Language Processing
Machine Learning Algorithms & Natural Language Processing
RouteMoA: Dynamic Routing Without Pre‑Inference for Efficient Multi‑Agent Mixtures

Recent advances in large language models have shifted from improving a single model to coordinating multiple specialized models. Mixture‑of‑Agents (MoA) frameworks let several models generate answers in parallel, interact, and fuse results, but the standard approach requires every model to run inference before any selection, leading to high computational cost and latency.

Problem with Existing MoA Methods

All current MoA variants assume that to decide which model is better, the system must first see each model’s output. Consequently, the workflow is "full inference → selection → fusion," which creates two major issues:

Even if only a few models are finally used, the cost of running all models cannot be avoided.

Scaling to large model pools quickly becomes infeasible because the total inference cost and context‑length limits explode.

The bottleneck, therefore, lies not in the selection algorithm itself but in the mandatory full‑model inference step.

RouteMoA: Shifting Selection Before Inference

RouteMoA introduces a three‑step pipeline that moves the "who to run" decision to the pre‑inference stage.

Prior Screening : A lightweight scorer examines only the user query and predicts a coarse performance score for each model. No large‑model inference is performed; the scorer simply ranks models and narrows the pool to a promising subset based on predicted capability.

Posterior Correction : Because the prior screen can make mistakes, RouteMoA adds a "mixture‑of‑judges" layer that evaluates already‑generated outputs without launching new inference calls. Two types of judges are used:

self‑assessment – a model scores its own answer;

cross‑assessment – high‑quality models evaluate the answers of other models.

These assessments rely solely on existing outputs, avoiding any extra inference.

Combined Ranking : The final ranking optimizes three objectives simultaneously: output quality, token cost, and inference latency. The system thus selects a set of models that offers the best trade‑off rather than merely the highest accuracy.

Experimental Results

In a large‑scale experiment with 15 models, RouteMoA achieved:

Cost reduction of 89.8%;

Latency reduction of 63.6%;

Overall accuracy that surpasses both the original MoA and Sparse MoA baselines.

These results show that eliminating unnecessary inference not only saves resources but can also improve the final answer quality.

Key Insight: Sparsity of Effective Models

The authors observe that for the vast majority of queries, only a small subset of models is truly needed. The scorer places the correct model within the top‑3 candidates with roughly 98% probability, meaning the system can safely ignore the rest of the pool without missing the right answer.

Failure Analysis

Analysis of error cases reveals that more than 50% of mistakes stem from "aggregation drift" during the final answer fusion, while mis‑selection of models accounts for a far smaller fraction. This indicates that, as multi‑model systems mature, the primary challenge shifts from selecting models to effectively integrating multiple answers.

Conclusion

RouteMoA is more than a faster MoA variant; it proposes a new paradigm in which model participation is no longer assumed by default. By predicting which models are worth invoking and then using lightweight judges to correct and amplify the best answers, the approach demonstrates that scheduling and coordination are as crucial as model capability in the era of large‑scale multi‑model collaboration.

Figure
Figure
Figure
Figure
Figure
Figure
Figure
Figure
Figure
Figure
Figure
Figure
Figure
Figure
Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

Inference OptimizationModel Selectiondynamic routingMixture of AgentsACL 2026ScorerSparse MoA
Machine Learning Algorithms & Natural Language Processing
Written by

Machine Learning Algorithms & Natural Language Processing

Focused on frontier AI technologies, empowering AI researchers' progress.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.