Exploring WuDaoMM: A 650M Chinese‑English Multimodal Dataset for Pre‑training
The article introduces WuDaoMM and WuDaoCorpora 2.0, massive Chinese‑English multimodal datasets—including 650 million image‑text pairs, 3 TB of text, 93 TB of images, and 181 GB of dialogue—detailing their composition, formats, access options, and potential research applications.
WuDaoMM Base Dataset
WuDaoMM is a Chinese‑English multimodal dataset with ~650 million image‑text pairs (≈93 TB). It contains 5 × 10⁷ strongly related pairs and 6 × 10⁸ weakly related pairs, organized into 19 high‑level categories (energy, emotion, industry, medicine, scenery, animals, news, flowers, education, art, people, science, ocean, trees, cars, social, technology, sports). Each category holds 70 k–400 k examples.
For rapid prototyping, the authors release a baseline subset WuDaoMM‑base comprising 5 million strongly related pairs, sampled evenly across the 19 categories. Full‑dataset access requires contacting [email protected].
WuDaoCorpora 2.0
WuDaoCorpora 2.0 aggregates three large sub‑datasets:
Text Corpus : 3 TB of cleaned text data (≈200 GB released as open‑source) in JSON format. The raw source exceeds 100 TB of web pages; cleaning applies >20 rules to remove privacy‑sensitive content. The corpus covers 50+ industry tags (e.g., education, technology).
Image‑Text Corpus : 6.5 × 10⁸ high‑quality image‑text pairs (~93 TB) stored as JSON metadata plus JPG images. Each record includes url (download link), captions (description), name (file name), and tag (category). The data span 60+ categories and include both Chinese and Western sources, with illegal or sensitive content filtered out.
Dialogue Corpus : 181 GB of Chinese dialogue data (~1.4 billion dialogue turns) in JSON format, filtered from 9 TB of raw data. It is intended for research on intelligent assistants, virtual companions, and open‑domain dialogue systems.
Access and References
Paper: WuDaoMM: A large‑scale Multi‑Modal Dataset for Pre‑training models (arXiv:2203.11480).
Data portal: https://data.wudaoai.cn/home
The dataset supports large‑scale multimodal pre‑training for Chinese AI models such as Wenlan and CogView, addressing the shortage of high‑quality Chinese multimodal data.
Baobao Algorithm Notes
Author of the BaiMian large model, offering technology and industry insights.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
