Essential Multimodal Datasets for AI Research – Links, Stats, and Quick Overview
This article compiles a curated list of widely used multimodal datasets—including CLEVR, Visual Genome, Pangea, Touch‑Vision‑Language, WIT, and more—providing download URLs, key statistics, and brief descriptions to help researchers quickly locate the right data for vision‑language and multimodal model training.
CLEVR (Multimodal)
Download link: http://eu5bx.ensl.cn/4f
CLEVR is a synthetic image‑question answering dataset designed to evaluate reasoning abilities such as counting, comparison, and logical relationships.
Visual Genome
Download link: http://eu5ba.ensl.cn/ad
Visual Genome is a large-scale dataset and knowledge base that links structured image concepts with language.
108,077 images
5.4 million region descriptions
1.7 million visual question‑answer pairs
3.8 million object instances
2.8 million attributes
2.3 million relationships
All content mapped to WordNet synsets
Pangea‑7B
Download link: http://eu5bd.ensl.cn/9c
Pangea‑7B is an open‑source multilingual multimodal language model (MLLM) trained on a 6 M instruction dataset covering 39 languages. It is evaluated on the PangeaBench suite, which contains 14 datasets across 47 languages.
MultiCaRe
Download link: http://eu5be.ensl.cn/36
MultiCaRe is an open‑source clinical case dataset for medical image classification and multimodal AI. It contains over 72 K de‑identified PubMed Central case reports, summarizing more than 93 K clinical cases and 130 K images across specialties (oncology, cardiology, surgery, pathology). The dataset provides a hierarchical classification with >140 classes and logical constraints such as mutual exclusivity.
Touch‑Vision‑Language (TVL) Dataset
Download link: http://eu5b8.ensl.cn/20
TVL pairs tactile and visual observations with human annotations and VLM‑generated tactile semantic labels. Data were collected using a handheld 3D‑printed device equipped with a DIGIT tactile sensor (capturing deformable‑surface RGB images) and a Logitech BRIO camera for synchronized visual capture. Each sample includes time‑aligned tactile‑visual‑language triples.
WIT (Multimodal)
Download link: http://eu5bu.ensl.cn/b2
WIT (Wikipedia Image Text) is a large multilingual multimodal dataset containing 37.6 million entities, 11.5 million unique images, and text in 108 Wikipedia languages, suitable for training large‑scale vision‑language models.
Wukong Dataset
Download link: http://eu5br.ensl.cn/8f
Wukong is a massive Chinese multimodal dataset containing 100 million image‑text pairs. Images are filtered for size (>200 px) and aspect ratio (1/3 – 3). Text is screened for language, length, frequency, privacy, and sensitive content.
MINT‑1T
Dataset link: http://edvvz.ensl.cn/c3
MINT‑1T is an open‑source ultimate‑modal interleaved dataset with 10 trillion text tokens and 3.4 billion images, expanding existing open datasets by an order of magnitude.
WuDaoCorpora Text Pre‑training Dataset
Dataset link: http://edvvt.ensl.cn/ce
WuDaoCorpora, built by the Beijing Academy of Artificial Intelligence, is a large‑scale high‑quality dataset for large‑model training. It comprises text, dialogue, image‑text pairs, and video‑text pairs, aiming to bridge language, vision, and video modalities.
Conceptual Captions
Dataset link: http://edvv7.ensl.cn/09
Conceptual Captions contains over 3 million image‑caption pairs with natural‑language captions, useful for image‑text pre‑training.
SBU Captions Dataset
Dataset link: http://edvvj.ensl.cn/76
SBU Captions provides 1 million images with title‑style textual descriptions.
MiniGPT‑4 Dataset
Dataset link: http://edvv5.ensl.cn/7a
This dataset supplies high‑quality image‑text pairs for the second‑stage fine‑tuning of the MiniGPT‑4 model.
Ego‑Exo4D
Dataset link: https://ego-exo4d-data.org/
Ego‑Exo4D offers three synchronized video‑language datasets: expert commentary, narrate‑and‑act descriptions, and one‑sentence atomic captions, supporting research on video language understanding.
Code example
收
藏
,
分
享
、
在
看
,
给
个
三
连
击呗!How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
