RouteMoA: Dynamic Routing Without Pre‑Inference for Efficient Multi‑Agent Mixture
The paper introduces RouteMoA, a dynamic routing framework that predicts model capabilities before inference to avoid unnecessary computation, thereby cutting cost by 89.8% and latency by 63.6% while improving accuracy in large‑scale multi‑model pools.
Problem with Existing Mixture‑of‑Agents (MoA)
Current MoA methods assume that to decide which model is best, every model must first generate an answer. This leads to the pipeline all‑model inference → selection → fusion . The two resulting issues are:
Computational cost cannot be reduced because the initial inference is performed for all models, even if only a few are ultimately used.
Scalability breaks down as the model pool grows; full‑pool inference quickly exceeds resource limits and context windows.
The bottleneck is therefore the pre‑selection inference cost, not the fusion step.
RouteMoA: Shifting Model Selection Before Inference
RouteMoA introduces a three‑stage workflow that moves the selection step ahead of any large‑model inference.
1. Prior Screening with a Lightweight Scorer
A lightweight scorer consumes only the user query and predicts a coarse performance score for each model in the pool. No large‑model inference is invoked. The scorer narrows the pool to a promising subset, effectively estimating query‑model match in advance.
2. Posterior Correction Using Existing Outputs
Because the prior screening may miss some models, RouteMoA adds a correction stage that operates solely on already‑generated answers. It employs a mixture‑of‑judges consisting of:
Self‑assessment – each model scores its own answer.
Cross‑assessment – high‑quality models evaluate the answers of other models.
Both assessments rely only on the existing outputs and do not trigger additional inference calls.
3. Integrated Ranking Optimizing Quality, Cost, and Latency
The final ranking jointly optimizes three objectives: output quality, token cost, and inference latency. The decision balances performance with efficiency rather than maximizing accuracy alone.
Experimental Evaluation
Experiments were conducted on a pool of 15 heterogeneous models.
Computational cost reduced by 89.8% .
Inference latency reduced by 63.6% .
Overall accuracy improved relative to standard MoA and Sparse MoA.
The scorer placed the correct model within the top‑3 candidates with a probability of 98% , indicating that most queries require only a few key models.
Failure Analysis
Analysis of error cases revealed that more than 50% of failures stem from aggregation drift during the fusion stage , while mis‑selection of models accounts for a much smaller fraction. This shows that the primary challenge in multi‑model systems has shifted from “which model to invoke” to “how to integrate multiple answers”.
Key Insights
Multi‑model systems are inherently sparse: for the majority of queries, only a small subset of models is truly critical.
Effective pre‑screening that retains the critical models enables downstream collaboration to amplify correct answers without incurring unnecessary computation.
Conclusion
RouteMoA demonstrates a new paradigm for multi‑model orchestration: predict model usefulness before inference, then refine answers through collaborative judging, and finally rank by a multi‑objective utility function. This makes system‑level scheduling as important as model capability in large‑scale LLM deployments.
Paper: https://arxiv.org/abs/2601.18130
Code: https://github.com/Jize-W/RouteMoA
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
