How OmniScience Dataset Boosts Multimodal AI Understanding of Scientific Figures
The OmniScience project introduces a 1.5‑million high‑quality image‑text pair dataset and a sophisticated pipeline that parses complex scientific documents, rewrites figure captions with large language models, and dramatically improves multimodal AI performance on benchmark tests.
OmniScience Dataset Overview
OmniScience is an open‑source multimodal dataset containing 1,500,000 high‑quality image‑text pairs extracted from high‑impact open‑access journals and preprint servers. The collection spans ten scientific disciplines (biology, materials science, physics, computer science, etc.) and includes more than 5,000,000 precisely localized sub‑images. In total the dataset comprises 4.3 billion tokens (1.9 billion image tokens and 2.4 billion text tokens), providing a dense training foundation for multimodal AI models.
Document Parsing and Curation
Researchers used the Uni‑Parser framework together with OCR to detect character labels embedded in figures. Heuristic rules were applied to resolve cross‑page and cross‑column references, enabling accurate extraction of:
Original figures
Corresponding captions
Surrounding paragraph text
Benchmarking on 500 scientific documents yielded 100 % precision for image‑caption‑text alignment.
Deduplication was performed in two stages:
Document‑level deduplication based on DOI identifiers.
Image‑level deduplication using perceptual hash similarity to remove near‑duplicate figures.
After these steps the final OmniScience corpus was assembled.
Dynamic Model Routing Pipeline
A scheduler routes caption‑rewriting tasks to the most appropriate large language model (LLM) according to three criteria:
Discipline (e.g., biology vs. materials science)
Visual type (SEM, NMR spectra, statistical charts, etc.)
Complexity of the original description (length, presence of long background text)
Specialized models are assigned as follows:
Gemini series – dense scientific visualizations such as scanning electron microscopy (SEM) images, nuclear magnetic resonance (NMR) spectra, and chemical structure diagrams.
Long‑context LLMs – samples that contain extensive narrative text.
Cost‑effective models (e.g., Qwen‑3, GPT‑5) – basic statistical charts and simpler figures.
Quality‑Control and Fact‑Checking Loop
Each generated description passes through a visual‑language model‑based fact‑checking module that triangulates three sources:
Original figure image
Original caption
Newly generated description
If the module detects fabricated content or logical inconsistency, the error is fed back to the routing pipeline for regeneration.
Benchmark Evaluation
Fine‑tuning the Qwen‑2.5 model on OmniScience‑enhanced captions produced the following improvements:
Multimodal similarity score on the OmniScience validation set increased from 0.769 to 0.956.
Human‑aligned scoring (fluency, information consistency, key‑detail accuracy, richness) achieved a Pearson correlation of 0.831 with expert judgments.
When evaluated on external multimodal benchmarks, the OmniScience‑trained system showed absolute gains of:
+0.140 on the MMMU test set.
+0.083 on a remote‑sensing benchmark.
These results demonstrate that richer textual descriptions enable the model to answer complex scientific questions using text alone.
Key Technical Takeaways
High‑quality, discipline‑balanced multimodal data can be constructed by combining advanced document parsing (Uni‑Parser + OCR) with rigorous deduplication.
Dynamic routing of caption‑rewriting tasks to domain‑specialized LLMs maximizes both accuracy and cost efficiency.
Closed‑loop fact‑checking using visual‑language models ensures generated descriptions remain faithful to source figures.
Training on richly annotated captions yields substantial gains in cross‑modal similarity and downstream benchmark performance.
Dataset repository: https://huggingface.co/datasets/UniParser/OmniScience
Preprint describing the methodology: https://arxiv.org/pdf/2602.13758
SuanNi
A community for AI developers that aggregates large-model development services, models, and compute power.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
