Artificial Intelligence 12 min read

Can Multimodal LLMs Beat Humans in Real Web Search? GPT‑5.2 Scores Only 36% on New BrowseComp‑V3 Benchmark

A new multimodal browsing benchmark, BrowseComp‑V3, reveals that human experts achieve a 68.03% success rate while the strongest closed‑source model, GPT‑5.2, manages just 36.17%, highlighting current limitations in deep web‑scale visual‑text reasoning and the critical role of tool‑augmented agents.

Machine Learning Algorithms & Natural Language Processing

Mar 13, 2026

Can Multimodal LLMs Beat Humans in Real Web Search? GPT‑5.2 Scores Only 36% on New BrowseComp‑V3 Benchmark

Why a New Multimodal Search Benchmark?

Earlier benchmarks such as MM‑BrowseComp and MMSearch‑Plus introduced multi‑hop queries and fine‑grained visual reasoning, but they remained limited to shallow, two‑hop tasks with visual cues only at the start, failing to reflect the complexity of real‑world web search.

BrowseComp‑V3 addresses three major gaps:

Task complexity: expands search paths with multi‑hop variants and three levels of cross‑modal integration (Level 1: intra‑region alignment, Level 2: cross‑region synthesis, Level 3: cross‑image reasoning).

Process‑oriented evaluation: besides success rate, a Process Score tracks how many annotated sub‑goals a model completes, enabling precise failure analysis.

Reliability and reproducibility: all evidence must be retrievable via public search engines, and each question includes a hand‑crafted “golden search trajectory”.

Benchmark Construction

The dataset contains 300 carefully curated, high‑difficulty questions spanning science, technology, society, culture, and daily life (24 sub‑domains). Construction followed a five‑stage quality‑control pipeline involving over 20 PhD‑level annotators:

Initialization & guideline creation: experts defined evaluation dimensions and produced high‑quality seed examples.

Tool‑enhanced exploratory annotation: annotators used text search, web browsing, image search, and cropping tools to record full interaction traces and sub‑goals.

Dual verification & adversarial filtering: independent reviewers reproduced each trace, then state‑of‑the‑art vision models (e.g., GPT‑5.2, Gemini‑3‑Pro) filtered out easy items.

Structured format conversion: traces were transformed into a unified JSON schema.

Expert final audit: domain specialists checked safety, privacy, and factual accuracy.

Experimental Findings

Four test settings were evaluated: human experts, tool‑free MLLMs, tool‑enhanced official MLLMs, and the open‑source OmniSeeker framework.

Finding 1 – Performance cliff: Human experts achieved a 68.03% success rate (Process Score 82.93%). The best closed‑source model, GPT‑5.2, reached only 36.17%.

Finding 2 – Tools are essential: In the tool‑free setting, most models dropped to ~10% success, showing that real‑time retrieval and interaction are indispensable for deep multimodal reasoning.

Finding 3 – Open‑source progress: With OmniSeeker, the open‑source model Doubao‑Seed‑1.8 rose to a 33.67% success rate, narrowing the gap to proprietary systems.

Finding 4 – Process Score insight: Models often complete early sub‑goals but fail later, resulting in Process Scores higher than final success rates, indicating fragility in long‑sequence reasoning.

Deep Dive into Model Weaknesses

Analysis shows performance degrades sharply from Level 1 to Level 3, exposing difficulty in cross‑image visual grounding and multimodal integration. Humans suffer from cognitive overload on long texts, while models struggle to fuse visual and textual cues in noisy web layouts.

Increasing test‑time computation improves results: more interaction rounds benefit larger models (e.g., Qwen3‑VL‑235B), and a Best‑of‑N sampling strategy—running multiple independent searches and selecting the best answer—further lifts success rates.

Conclusion and Outlook

BrowseComp‑V3 and the OmniSeeker framework demonstrate that merely adding basic visual perception and tool calls is insufficient. Future research must advance deep cross‑modal integration and long‑range planning to unlock the full potential of multimodal browsing agents.

Paper: https://arxiv.org/abs/2602.12876<br/>Code: https://github.com/Halcyon-Zhang/BrowseComp-V3<br/>Dataset: https://huggingface.co/datasets/Halcyon-Zhang/BrowseComp-V3

multimodal LLM GPT-5.2 human performance OmniSeeker process evaluation web browsing benchmark

Written by

Machine Learning Algorithms & Natural Language Processing

Focused on frontier AI technologies, empowering AI researchers' progress.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.