Re‑annotating ImageNet: 1.28 M Images Gain Multi‑Labels, Boosting COCO mAP by 4 Points
A Rochester research team automatically relabeled the entire 1.28 M‑image ImageNet training set with multi‑labels using self‑supervised object discovery and a lightweight region classifier, resulting in a pretrained model that raises COCO mAP by 4.2 points and VOC mAP by 2.3 points.
ImageNet has long been the backbone of computer‑vision research, yet its single‑label design limits supervision for images that contain multiple objects. Approximately 15% of ImageNet images actually have two or more relevant categories, causing noisy training signals and misleading performance metrics.
Automatic multi‑label relabeling pipeline
The Rochester team built a three‑stage, fully automated workflow to convert ImageNet into a dense multi‑label dataset.
1. Unsupervised object discovery
They leveraged the self‑supervised vision backbone DINOv3 together with the unsupervised object detector MaskCut to generate candidate object regions for every image. Unlike the Segment‑Anything Model (SAM), MaskCut is optimized for object‑level proposals, producing stable masks that cover salient regions. Each image yields up to N candidate boxes, and any remaining area is treated as an unlabeled instance.
2. Training a region classifier
Using soft‑label maps from the prior ReLabel work as guidance, the pipeline selects the region most aligned with the original ImageNet single label as a positive example. A lightweight classification head is then trained to predict the ImageNet class for any given region, forcing the model to focus on the object itself rather than background cues.
3. Multi‑label inference
After training, the classifier is applied to all candidate regions of every image. Each region outputs a class‑score vector; scores above a threshold are kept, and overlapping predictions are merged. The result is an explicit multi‑label set for each of the 1.28 M training images, with every label traceable to a specific region.
Results and impact
Retraining the same architecture on the newly annotated dataset yields a substantial boost: on COCO the model gains +4.2 mAP, and on VOC it improves by +2.3 mAP. Moreover, ImageNet‑V2 accuracy drop of 14% is largely mitigated, showing that better supervision, not larger models, drives the gains.
Limitations and future direction
The current approach still confines predictions to the original 1,000 ImageNet classes, so novel objects (e.g., a climbing boot not in the taxonomy) are mapped to the nearest known class, illustrating the “closed‑vocabulary” ceiling. The authors argue that future work should aim for open‑vocabulary detection, enabling models to recognize and describe any object beyond the fixed label set.
https://arxiv.org/pdf/2603.05729AIWalker
Focused on computer vision, image processing, color science, and AI algorithms; sharing hardcore tech, engineering practice, and deep insights as a diligent AI technology practitioner.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
