CMIngre: A Cross‑Modal Ingredient‑Level Dataset for Chinese Food Understanding
The CMIngre dataset, created by Meituan’s R&D platform and Tianjin University, offers 8,001 image‑text pairs of 429 Chinese dishes with 95,290 ingredient bounding boxes, enabling fine‑grained ingredient detection and cross‑modal retrieval tasks, and baseline experiments show DINO and CLIP models achieve the strongest performance.
Meituan’s in‑store R&D platform and Professor Liu An‑an’s team from Tianjin University collaborated on a research project titled “Cross‑Modal Ingredient‑Level Knowledge Graph Construction”. The work introduces a new benchmark dataset, CMIngre (Cross‑Modal Ingredient‑level Dataset), which provides fine‑grained ingredient annotations for Chinese dishes.
The dataset contains 8,001 image‑text pairs collected from three sources (dish photos, recipe images, and user‑generated content). It covers 429 distinct Chinese dishes and includes 95,290 ingredient bounding boxes across 429 ingredient categories after cleaning and hierarchical merging.
CMIngre enables two core tasks: (1) ingredient detection – locating and classifying individual ingredients in dish images, and (2) cross‑modal ingredient retrieval – matching images with ingredient sets and vice‑versa. Baseline experiments evaluate classic CNN detectors (Faster R‑CNN, YOLO v5) and a Transformer‑based detector (DINO) for the detection task, and several image‑text matching models (ResNet‑50, ViT B/16, CLIP ViT B/16) for retrieval.
Results show that while existing detectors achieve moderate performance on CMIngre, DINO outperforms Faster R‑CNN at higher IoU thresholds, indicating the benefit of large‑scale pre‑training. For retrieval, CLIP‑based backbones achieve the best median rank (medR) and Recall@K scores, and a two‑stage approach that first extracts region features with an ingredient detector further improves performance over end‑to‑end methods.
The authors contribute (1) the definition of a new Chinese‑food understanding task, (2) the CMIngre dataset, (3) baseline detection and retrieval methods, and (4) extensive experiments demonstrating the dataset’s difficulty compared with COCO and Recipe1M. The work aims to stimulate research on fine‑grained food understanding and cross‑modal food retrieval.
Meituan Technology Team
Over 10,000 engineers powering China’s leading lifestyle services e‑commerce platform. Supporting hundreds of millions of consumers, millions of merchants across 2,000+ industries. This is the public channel for the tech teams behind Meituan, Dianping, Meituan Waimai, Meituan Select, and related services.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
