Artificial Intelligence 9 min read

How OmniScience Dataset Boosts Multimodal AI Understanding of Scientific Figures

The OmniScience project introduces a 1.5‑million high‑quality image‑text pair dataset and a sophisticated pipeline that parses complex scientific documents, rewrites figure captions with large language models, and dramatically improves multimodal AI performance on benchmark tests.

SuanNi

Mar 27, 2026

How OmniScience Dataset Boosts Multimodal AI Understanding of Scientific Figures

OmniScience Dataset Overview

OmniScience is an open‑source multimodal dataset containing 1,500,000 high‑quality image‑text pairs extracted from high‑impact open‑access journals and preprint servers. The collection spans ten scientific disciplines (biology, materials science, physics, computer science, etc.) and includes more than 5,000,000 precisely localized sub‑images. In total the dataset comprises 4.3 billion tokens (1.9 billion image tokens and 2.4 billion text tokens), providing a dense training foundation for multimodal AI models.

Document Parsing and Curation

Researchers used the Uni‑Parser framework together with OCR to detect character labels embedded in figures. Heuristic rules were applied to resolve cross‑page and cross‑column references, enabling accurate extraction of:

Original figures

Corresponding captions

Surrounding paragraph text

Benchmarking on 500 scientific documents yielded 100 % precision for image‑caption‑text alignment.

Deduplication was performed in two stages:

Document‑level deduplication based on DOI identifiers.

Image‑level deduplication using perceptual hash similarity to remove near‑duplicate figures.

After these steps the final OmniScience corpus was assembled.

Dynamic Model Routing Pipeline

A scheduler routes caption‑rewriting tasks to the most appropriate large language model (LLM) according to three criteria:

Discipline (e.g., biology vs. materials science)

Visual type (SEM, NMR spectra, statistical charts, etc.)

Complexity of the original description (length, presence of long background text)

Specialized models are assigned as follows:

Gemini series – dense scientific visualizations such as scanning electron microscopy (SEM) images, nuclear magnetic resonance (NMR) spectra, and chemical structure diagrams.

Long‑context LLMs – samples that contain extensive narrative text.

Cost‑effective models (e.g., Qwen‑3, GPT‑5) – basic statistical charts and simpler figures.

Quality‑Control and Fact‑Checking Loop

Each generated description passes through a visual‑language model‑based fact‑checking module that triangulates three sources:

Original figure image

Original caption

Newly generated description

If the module detects fabricated content or logical inconsistency, the error is fed back to the routing pipeline for regeneration.

Benchmark Evaluation

Fine‑tuning the Qwen‑2.5 model on OmniScience‑enhanced captions produced the following improvements:

Multimodal similarity score on the OmniScience validation set increased from 0.769 to 0.956.

Human‑aligned scoring (fluency, information consistency, key‑detail accuracy, richness) achieved a Pearson correlation of 0.831 with expert judgments.

When evaluated on external multimodal benchmarks, the OmniScience‑trained system showed absolute gains of:

+0.140 on the MMMU test set.

+0.083 on a remote‑sensing benchmark.

These results demonstrate that richer textual descriptions enable the model to answer complex scientific questions using text alone.

Key Technical Takeaways

High‑quality, discipline‑balanced multimodal data can be constructed by combining advanced document parsing (Uni‑Parser + OCR) with rigorous deduplication.

Dynamic routing of caption‑rewriting tasks to domain‑specialized LLMs maximizes both accuracy and cost efficiency.

Closed‑loop fact‑checking using visual‑language models ensures generated descriptions remain faithful to source figures.

Training on richly annotated captions yields substantial gains in cross‑modal similarity and downstream benchmark performance.

Dataset repository: https://huggingface.co/datasets/UniParser/OmniScience

Preprint describing the methodology: https://arxiv.org/pdf/2602.13758

multimodal AI visual-language models Data Annotation scientific dataset

Written by

SuanNi

A community for AI developers that aggregates large-model development services, models, and compute power.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.