Artificial Intelligence 6 min read

Exploring WuDaoMM: A 650M Chinese‑English Multimodal Dataset for Pre‑training

The article introduces WuDaoMM and WuDaoCorpora 2.0, massive Chinese‑English multimodal datasets—including 650 million image‑text pairs, 3 TB of text, 93 TB of images, and 181 GB of dialogue—detailing their composition, formats, access options, and potential research applications.

Baobao Algorithm Notes

Mar 24, 2022

Exploring WuDaoMM: A 650M Chinese‑English Multimodal Dataset for Pre‑training

WuDaoMM Base Dataset

WuDaoMM is a Chinese‑English multimodal dataset with ~650 million image‑text pairs (≈93 TB). It contains 5 × 10⁷ strongly related pairs and 6 × 10⁸ weakly related pairs, organized into 19 high‑level categories (energy, emotion, industry, medicine, scenery, animals, news, flowers, education, art, people, science, ocean, trees, cars, social, technology, sports). Each category holds 70 k–400 k examples.

For rapid prototyping, the authors release a baseline subset WuDaoMM‑base comprising 5 million strongly related pairs, sampled evenly across the 19 categories. Full‑dataset access requires contacting [email protected].

WuDaoCorpora 2.0

WuDaoCorpora 2.0 aggregates three large sub‑datasets:

Text Corpus : 3 TB of cleaned text data (≈200 GB released as open‑source) in JSON format. The raw source exceeds 100 TB of web pages; cleaning applies >20 rules to remove privacy‑sensitive content. The corpus covers 50+ industry tags (e.g., education, technology).

Image‑Text Corpus : 6.5 × 10⁸ high‑quality image‑text pairs (~93 TB) stored as JSON metadata plus JPG images. Each record includes url (download link), captions (description), name (file name), and tag (category). The data span 60+ categories and include both Chinese and Western sources, with illegal or sensitive content filtered out.

Dialogue Corpus : 181 GB of Chinese dialogue data (~1.4 billion dialogue turns) in JSON format, filtered from 9 TB of raw data. It is intended for research on intelligent assistants, virtual companions, and open‑domain dialogue systems.

Access and References

Paper: WuDaoMM: A large‑scale Multi‑Modal Dataset for Pre‑training models (arXiv:2203.11480).

Data portal: https://data.wudaoai.cn/home

The dataset supports large‑scale multimodal pre‑training for Chinese AI models such as Wenlan and CogView, addressing the shortage of high‑quality Chinese multimodal data.

multimodal dataset large-scale data Pre‑training Chinese AI image‑text pairs WuDaoMM

Written by

Baobao Algorithm Notes

Author of the BaiMian large model, offering technology and industry insights.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.