Artificial Intelligence 17 min read

Open Vocabulary Detection Contest 2023: Summary of Winning Teams' Technical Solutions

The article reviews the Open Vocabulary Detection Contest organized by the Chinese Society of Image and Graphics and 360 AI Institute, describing the competition setup, dataset characteristics, and detailed winning approaches that combine Detic, CLIP, prompt learning, and multi‑stage pipelines to achieve strong few‑shot and zero‑shot object detection performance.

DataFunTalk

Nov 24, 2023

Open Vocabulary Detection Contest 2023: Summary of Winning Teams' Technical Solutions

The Open Vocabulary Detection Contest (OVD) 2023, co‑hosted by the Chinese Society of Image and Graphics and 360 AI Institute, concluded with 140 participating teams from universities and companies, focusing on open‑world object detection in e‑commerce product images.

OVD Technology Overview : OVD leverages cross‑modal models such as CLIP, ALIGN, and R2D2 to enable few‑shot detection of predefined categories and zero‑shot detection of unseen categories, addressing the limitations of traditional closed‑set detectors.

Competition Details : The dataset contains 466 product categories (233 base and 233 novel), with simple backgrounds and few objects per image. Base categories provide limited annotated samples, while novel categories have no annotations. Evaluation uses mAP@50 on both sets.

Champion Solution (Nanyang Technological University) : Uses Detic with ResNet‑50/Swin‑B backbones and CenterNet2 detector, combines image‑level classification data (collected via web crawling) with detection data, employs a learnable 4‑token prompt (CoOp style), and integrates CLIP text embeddings for classification. Key hyper‑parameters include 18 000 iterations, batch sizes 8×96 (image‑level) and 8×4 (detection), and image resolutions 448/896.

Runner‑up Solution (Huazhong University of Science & Technology & Institute of Automation) : Utilizes a foreground detector trained without category labels, prompt engineering with LLMs to generate diverse textual prompts, and Chinese‑CLIP for multimodal alignment. The pipeline includes foreground proposal generation, prompt ensemble (uniform averaging), and OCR‑assisted post‑processing.

Third‑Place Solution (Institute of Automation) : Proposes a two‑stage framework separating box regression (using Cascade‑RCNN with Swin‑Transformer Small) and classification (Chinese‑CLIP fine‑tuned with ViT‑H‑224). Data augmentation includes large‑scale web‑crawled images, LLM‑generated descriptions, and LoRA/LiT fine‑tuning. Inference incorporates global feature fusion, OCR‑based correction, and class‑consistency rules.

Experimental Results : The champion achieved 50.08 % AP50 on novel classes and 54.16 % AP50 on base classes. Detailed ablation studies show the impact of prompt engineering, CLIP model ensembles, and data augmentation.

Open‑Source Links :

Champion code: https://github.com/wusize/OVD_Contest

Runner‑up code: https://github.com/FX-STAR/OVD2023

Third‑place code: https://github.com/xuliu-cyber/OVD2023

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

computer vision object detection CLIP competition Open-Vocabulary Detection zero-shot learning

Written by

DataFunTalk

Dedicated to sharing and discussing big data and AI technology applications, aiming to empower a million data scientists. Regularly hosts live tech talks and curates articles on big data, recommendation/search algorithms, advertising algorithms, NLP, intelligent risk control, autonomous driving, and machine learning/deep learning.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.