Artificial Intelligence 8 min read

NeurIPS 2025: OCRBench v2 Shows Gemini Leads Chinese OCR Ranking Yet Scores Only Pass

OCRBench v2, introduced at NeurIPS 2025, evaluates 58 multimodal models on 23 OCR‑related tasks in Chinese and English, revealing that even top models like Gemini‑2.5‑Pro barely exceed the passing threshold and that most models struggle with fine‑grained text localization and multilingual performance.

HyperAI Super Neural

Oct 14, 2025

NeurIPS 2025: OCRBench v2 Shows Gemini Leads Chinese OCR Ranking Yet Scores Only Pass

OCRBench v2 Overview

OCRBench v2 evaluates 58 mainstream multimodal large models released between 2023 and 2025 on both Chinese and English OCR tasks.

Task Coverage

The benchmark defines 23 fine‑grained tasks spanning eight core capability dimensions: text recognition, text localization, spotting, relation extraction, element parsing, mathematical calculation, visual text understanding, and knowledge reasoning.

Dataset

The public dataset contains over 10,000 high‑quality QA pairs collected from more than 80 academic datasets and manually verified. An additional private set of 1,500 QA pairs mirrors the public data’s task distribution and scenario coverage.

Dataset download: https://go.hyper.ai/VNHSX

Evaluation Results

Chinese leaderboard champion: Gemini‑2.5‑Pro (also third on English). English champion: Seed1.6‑vision (Chinese runner‑up). Open‑source Qwen3‑Omni‑30B‑A3B‑Instruct ranks second on English and third on Chinese.

Even the highest‑ranked models achieve only about 60 out of 100 points on average across Chinese and English tasks, indicating a pervasive “subject‑specific” weakness.

Gemini‑2.5‑Pro shows strong performance on computational questions, demonstrating logical reasoning ability. Llama‑3.1‑Nemotron‑Nano‑VL‑8B‑V1 attains a high English rank due to superior text‑positioning capability.

Basic text recognition remains relatively strong, but scores drop sharply on fine‑grained spatial and structural tasks such as Referring, Spotting, and Parsing. For example, English champion Seed1.6‑vision scores only 38.0 on Spotting.

Cross‑language comparison reveals imbalance: Llama‑3.1‑Nemotron‑Nano‑VL‑8B‑V1 scores 56.4 on English but 40.1 on Chinese, suggesting a bias toward English data or training strategies.

Closed‑source models (Gemini series, GPT‑5, Seed1.6‑vision) lead overall, yet open‑source models are increasingly competitive. Five of the top‑10 English models and seven of the top‑10 Chinese models are open source (e.g., Qwen‑Omni, InternVL, SAIL‑VL, Ovis), achieving near‑state‑of‑the‑art results on text‑localization and visual‑text understanding.

Rankings will be refreshed quarterly.

Paper: OCRBench v2: An Improved Benchmark for Evaluating Large Multimodal Models on Visual Text Localization and Reasoning (NeurIPS 2025 Datasets and Benchmarks Track). URL: https://go.hyper.ai/VNHSX

Code repository: https://github.com/Yuliang-Liu/MultimodalOCR

large language models OCR Gemini evaluation open-source models NeurIPS 2025 multimodal benchmark