Artificial Intelligence 31 min read

Food2K: A Large-Scale Food Image Dataset and Progressive Region Enhancement Network

This article reviews the Food2K dataset and the proposed Progressive Region Enhancement Network for large‑scale food image recognition, detailing dataset construction, method design, extensive experiments, ablation studies, visualizations, and future research directions, all validated on the IEEE T‑PAMI 2023 paper.

Meituan Technology Team

Feb 23, 2023

Food2K: A Large-Scale Food Image Dataset and Progressive Region Enhancement Network

1 Introduction

The Visual Intelligence team at Meituan and the Institute of Computing Technology, Chinese Academy of Sciences collaborated in 2020‑2021 on a fine‑grained food image recognition and retrieval project, resulting in the IEEE T‑PAMI 2023 paper "Large Scale Visual Food Recognition" (Min et al., 2023). The paper introduces the Food2K dataset and a Progressive Region Enhancement Network (PREN) for food image classification.

2 Food2K Dataset

Food2K contains 1,036,564 images covering 2,000 food categories organized into 12 super‑classes (e.g., vegetables, meat, barbecue, fried foods) and 26 sub‑classes. Compared with existing datasets such as Food‑101, Vireo Food‑172, and ISIA Food‑500, Food2K is an order of magnitude larger in both class count and image quantity. The dataset was built with strict cleaning, iterative labeling, and multiple expert checks to ensure high quality. It exhibits a long‑tail distribution (images per class ranging from 153 to 1,999) and includes finer‑grained annotations (e.g., multiple pizza sub‑categories) and diverse visual appearances caused by ingredient combinations, accessories, and layouts.

3 Method

3.1 Global‑Local Feature Learning

Food images display both global characteristics (overall shape, color, structure) and subtle local details (ingredient‑specific regions). PREN extracts global features via Global Average Pooling (GAP) on the last convolutional layer and learns complementary local features through a progressive training strategy.

3.2 Progressive Local Feature Learning

The local‑feature sub‑network is trained in stages. Early stages use shallow layers with small receptive fields to capture stable fine‑grained details; later stages expand the receptive field to learn coarser patterns. Each stage’s output is passed through a convolution + Global Max Pooling (GMP) to obtain a local feature vector. KL‑divergence is introduced between adjacent stages to force the network to focus on different regions, increasing inter‑stage diversity.

3.3 Region Feature Enhancement

To model relationships among local features, a self‑attention (Non‑Local) module aggregates multi‑scale context and produces enhanced local representations of the same spatial size. The enhanced local maps from all stages are fused with a convolutional layer and then combined with the global feature via a feature‑fusion layer.

3.4 Training and Inference

Training uses cross‑entropy loss at each stage and an additional loss at the fusion stage. A KL‑divergence term (weighted by α and β) is added to the total loss to increase stage‑wise discrepancy. During inference, predictions from all stages and the fused representation are summed with equal weight to obtain the final class scores.

4 Experiments

4.1 Performance on Food2K

Using ResNet‑101 as backbone, PREN improves Top‑1 accuracy by 2.24 % and Top‑5 accuracy by 1.4 % over the baseline ResNet. Table 1 (in the original paper) shows that PREN outperforms existing food‑recognition methods on Food2K.

4.2 Ablation Study

Adding the progressive learning (PL) component yields a noticeable gain; combining PL with Region Enhancement (RE) further boosts performance.

Increasing the number of progressive stages U from 1 to 3 raises Top‑1 accuracy from 81.45 % to 83.03 %; U=4 causes a drop, likely because shallow layers focus on non‑discriminative features.

Using predictions from individual stages versus the combined score shows that fusion of all stage scores achieves the best accuracy.

Balancing parameters α and β: when only KL‑divergence is used (α,β = 0) the model fails to converge; using only cross‑entropy degrades performance; the best results are obtained with a proper mix of both losses.

4.3 Visualization

Grad‑CAM visualizations on samples such as “Wasabi Octopus” demonstrate that baseline methods attend to limited regions, whereas PREN progressively focuses on different informative parts (e.g., vegetable leaf in stage 1, octopus body in stage 2, and overall shape in stage 3), confirming the effectiveness of progressive and attention‑based enhancements.

4.4 Generalization Experiments

Models pretrained on Food2K were fine‑tuned on five downstream tasks:

Food image classification on ETH Food‑101, Vireo Food‑172, and ISIA Food‑500 – all show consistent accuracy gains.

Food detection – mAP and AP75 improve more with Food2K than with Food‑101.

Food segmentation – all pretrained models achieve higher segmentation metrics.

Food image retrieval – mAP and Recall@1 increase by 4–5 % on the three benchmarks, especially on Vireo Food‑172.

Cross‑modal recipe‑image retrieval (Recipe1M) – pretraining on Food2K yields larger performance gains than pretraining on Food‑101.

5 Future Work

The authors outline several directions:

Robust large‑scale food recognition: current fine‑grained methods (e.g., PMG, PAR‑Net) underperform on Food2K; Transformers may offer improvements.

Human visual evaluation of food recognition, considering cultural and regional biases.

Cross‑modal and cross‑cuisine transfer learning, including scene‑level and super‑class transfer.

Large‑scale few‑shot food recognition (e.g., LS‑FSFR) using Food2K as a benchmark.

Food image generation with GANs and attribute‑rich extensions of Food2K.

Enriching Food2K with finer annotations (region‑level, pixel‑level, aesthetic attributes) to support new tasks.

6 Conclusion

Food2K is presented as a new large‑scale benchmark for food image recognition, supporting a wide range of visual and multimodal tasks. The proposed Progressive Region Enhancement Network, comprising progressive local feature learning and self‑attention‑based region enhancement, demonstrates superior performance across all evaluated tasks.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

computer vision Fine-Grained Classification dataset Food Image Recognition Food2K Progressive Region Enhancement

Written by

Meituan Technology Team

Over 10,000 engineers powering China’s leading lifestyle services e‑commerce platform. Supporting hundreds of millions of consumers, millions of merchants across 2,000+ industries. This is the public channel for the tech teams behind Meituan, Dianping, Meituan Waimai, Meituan Select, and related services.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.