RouteMoA: Dynamic Routing Without Pre‑Inference for Efficient Multi‑Agent Mixture

The paper introduces RouteMoA, a dynamic routing framework that predicts model capabilities before inference to avoid unnecessary computation, thereby cutting cost by 89.8% and latency by 63.6% while improving accuracy in large‑scale multi‑model pools.

Machine Heart
Machine Heart
Machine Heart
RouteMoA: Dynamic Routing Without Pre‑Inference for Efficient Multi‑Agent Mixture

Problem with Existing Mixture‑of‑Agents (MoA)

Current MoA methods assume that to decide which model is best, every model must first generate an answer. This leads to the pipeline all‑model inference → selection → fusion . The two resulting issues are:

Computational cost cannot be reduced because the initial inference is performed for all models, even if only a few are ultimately used.

Scalability breaks down as the model pool grows; full‑pool inference quickly exceeds resource limits and context windows.

The bottleneck is therefore the pre‑selection inference cost, not the fusion step.

RouteMoA: Shifting Model Selection Before Inference

RouteMoA introduces a three‑stage workflow that moves the selection step ahead of any large‑model inference.

1. Prior Screening with a Lightweight Scorer

A lightweight scorer consumes only the user query and predicts a coarse performance score for each model in the pool. No large‑model inference is invoked. The scorer narrows the pool to a promising subset, effectively estimating query‑model match in advance.

2. Posterior Correction Using Existing Outputs

Because the prior screening may miss some models, RouteMoA adds a correction stage that operates solely on already‑generated answers. It employs a mixture‑of‑judges consisting of:

Self‑assessment – each model scores its own answer.

Cross‑assessment – high‑quality models evaluate the answers of other models.

Both assessments rely only on the existing outputs and do not trigger additional inference calls.

3. Integrated Ranking Optimizing Quality, Cost, and Latency

The final ranking jointly optimizes three objectives: output quality, token cost, and inference latency. The decision balances performance with efficiency rather than maximizing accuracy alone.

Experimental Evaluation

Experiments were conducted on a pool of 15 heterogeneous models.

Computational cost reduced by 89.8% .

Inference latency reduced by 63.6% .

Overall accuracy improved relative to standard MoA and Sparse MoA.

The scorer placed the correct model within the top‑3 candidates with a probability of 98% , indicating that most queries require only a few key models.

Failure Analysis

Analysis of error cases revealed that more than 50% of failures stem from aggregation drift during the fusion stage , while mis‑selection of models accounts for a much smaller fraction. This shows that the primary challenge in multi‑model systems has shifted from “which model to invoke” to “how to integrate multiple answers”.

Key Insights

Multi‑model systems are inherently sparse: for the majority of queries, only a small subset of models is truly critical.

Effective pre‑screening that retains the critical models enables downstream collaboration to amplify correct answers without incurring unnecessary computation.

Conclusion

RouteMoA demonstrates a new paradigm for multi‑model orchestration: predict model usefulness before inference, then refine answers through collaborative judging, and finally rank by a multi‑objective utility function. This makes system‑level scheduling as important as model capability in large‑scale LLM deployments.

Paper: https://arxiv.org/abs/2601.18130

Code: https://github.com/Jize-W/RouteMoA

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

large language modelsModel Selectiondynamic routinginference efficiencyMixture of AgentsRouteMoA
Machine Heart
Written by

Machine Heart

Professional AI media and industry service platform

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.