HyperEyes: Parallel Multimodal Search Agents Move from Deep to Wide for Efficiency

HyperEyes introduces a unified‑location‑as‑search (UGS) action space, parallel data synthesis, and a dual‑granularity efficiency‑aware RL framework that enable multimodal agents to perform simultaneous multi‑target retrieval, dramatically reducing interaction rounds while improving accuracy and cost‑efficiency across benchmark evaluations.

Machine Heart
Machine Heart
Machine Heart
HyperEyes: Parallel Multimodal Search Agents Move from Deep to Wide for Efficiency

Problem: Serial "crop‑then‑search" pipeline

Open‑source multimodal search agents process multi‑entity images with a serial "N‑round" workflow: first crop each target, then search individually. This causes three major issues: (1) large interaction latency because a single query is split into many rounds; (2) error amplification where a mistake in early localization contaminates all subsequent searches; (3) reward bias in training because only the final answer correctness is rewarded, leading to "over‑search" behavior and harsh penalization of otherwise correct intermediate reasoning.

Method: Full‑stack redesign of action space, data, and reinforcement learning

Unified Localization‑and‑Search (UGS) action space

HyperEyes replaces the separate cropping and retrieval steps with a single action that embeds visual bounding boxes directly as parameters of the retrieval call. One function invocation can therefore carry multiple target boxes, enabling true multi‑target concurrency within a single interaction.

Parallel data synthesis pipeline

The team constructs learnable parallel behavior seeds through three stages:

Stitch diverse images to create queries that require simultaneous localization and retrieval.

Apply graph‑based random walks to generate multi‑constraint intersection problems and discard shortcut solutions.

Use progressive rejection sampling (PRS) under increasing round budgets to curate 30,000 "zero‑redundancy" parallel seeds for supervised fine‑tuning.

Dual‑granularity efficiency‑aware reinforcement learning

Two complementary mechanisms are introduced:

TRACE (macro‑level) : a dynamic cost‑efficiency reward that grants credit only when the model outperforms a moving efficiency benchmark. After each epoch the benchmark is refreshed with the best‑performing trajectory, tightening the bar.

OPD (micro‑level) : policy‑internal distillation that activates a 235B teacher model only on trajectories that end incorrectly, providing dense token‑level supervision for the failed steps.

Evaluation framework

The Image Multi‑Entity Benchmark (IMEB) introduces 300 challenging multi‑entity visual tasks. Alongside IMEB, the Cost‑Aware Score (CAS) aggregates accuracy, token consumption, and tool‑call rounds into a single metric representing "effective information density per unit latency".

Results

On six major multimodal benchmarks, HyperEyes‑30B achieves 64.0% accuracy, a 9.9% absolute gain over the strongest open‑source baseline VDR, while using only 2.2 tool‑call rounds compared with VDR's 11.6. The 235B version narrows the gap to the closed‑source Gemini‑3.1‑Pro to 1.1%.

Under the CAS metric, the 30B model improves cost‑efficiency by 7.6× relative to the next best open‑source model. Ablation studies show that the UGS action‑space redesign is the primary contributor to this advantage.

Robustness tests that mix true and false evidence demonstrate that the parallel "search‑once‑see‑all" strategy avoids hallucination traps caused by excessive searching. In a real‑world six‑person visual QA scenario, a traditional agent required 12 interaction rounds and failed, whereas HyperEyes completed the task in 3 rounds with correct answers.

Resources

Paper: https://arxiv.org/abs/2605.07177

Code: https://github.com/DeepExperience/HyperEyes

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

efficiencyAgentBenchmarkreinforcement learningparallel processingmultimodal search
Machine Heart
Written by

Machine Heart

Professional AI media and industry service platform

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.