LLMDet: LLM‑Powered Open‑Vocabulary Detector Beats Grounding DINO

LLMDet introduces a novel training pipeline that leverages large language models to generate detailed image‑level captions and region‑level phrases, fine‑tunes an open‑vocabulary detector with the GroundingCap‑1M dataset, and achieves state‑of‑the‑art zero‑shot performance surpassing Grounding DINO across multiple benchmarks.

AIWalker
AIWalker
AIWalker
LLMDet: LLM‑Powered Open‑Vocabulary Detector Beats Grounding DINO

Overview

Open‑vocabulary object detection aims to locate arbitrary categories specified by user‑provided text labels, extending beyond the closed‑set paradigm. LLMDet demonstrates that supervising an open‑vocabulary detector with large language models (LLMs) and rich image‑level descriptions can substantially improve performance.

GroundingCap‑1M Dataset

The authors construct a 1‑million‑sample dataset called GroundingCap‑1M. Each sample is a four‑tuple (I, T_s, B, T_i) where I is an image, T_s a short localization phrase, B a set of bounding boxes aligned with T_s, and T_i a detailed caption. Captions are collected with two principles: (1) maximal detail (object type, texture, color, parts, actions, precise location, and any text in the image) and (2) factual accuracy (no speculative language). The dataset aggregates existing detection, localization, and image‑text corpora (COCO, V3Det, GoldG, LCS‑558k) and augments them with captions generated by Qwen2‑VL‑72b. After filtering and merging, GroundingCap‑1M contains 1.12 M samples.

Training LLMDet Under LLM Supervision

The training proceeds in two stages (Fig. 3). First, a projection layer aligns detector visual features with the LLM input space while keeping the detector backbone frozen. The detector’s p5 feature map is fed to the LLM, which is trained to generate full‑image captions (language modeling loss). In the second stage, the whole system is fine‑tuned jointly with four losses: standard localization loss, bounding‑box regression loss, image‑level caption generation loss, and region‑level short‑phrase generation loss (Fig. 4). For region‑level generation, positive queries (matched to ground‑truth boxes) are used as prompts such as “describe this region”. Cross‑attention layers are added only to the region‑level branch to allow the LLM to attend to the corresponding visual token.

Implementation Details

MM‑Grounding‑DINO (MM‑GDINO) serves as the baseline detector. The detector backbone is frozen; the LLM (initialized from LLaVA‑OneVision‑0.5b‑ov) is kept trainable. Image‑level generation uses a maximum token length of 1600, region‑level generation 40, and up to 16 regions per image. Training runs for two epochs (≈16 k iterations) on eight NVIDIA L20 GPUs with batch size 16, using mixed‑precision and gradient checkpointing.

Zero‑Shot Transfer Evaluation

LLMDet is evaluated zero‑shot on LVIS, ODinW, COCO‑O, RefCOCO+, RefCOCOg, and several REC benchmarks. The LLM is discarded at inference, so computational cost matches the baseline. On LVIS minival, LLMDet improves average precision (AP) by 3.3‑14.3 % over MM‑GDINO depending on the backbone (Swin‑L achieves 50.6 % AP). On ODinW35, LLMDet attains the highest AP, confirming strong cross‑dataset transfer. On COCO‑O, LLMDet gains +2.1 % AP, showing robustness to domain shift. REC tasks also see consistent gains, attributed to richer visual‑language alignment.

Ablation Studies

Impact of Dataset Components

Using only bounding‑box labels raises AP from 41.4 % to 43.8 %. Adding image‑level captions yields a modest gain, while combining image‑ and region‑level generation provides the largest boost, especially for rare categories (see Table 6).

Effect of Different LLMs

Replacing the default LLaVA‑OneVision‑0.5b‑ov with larger models (e.g., LLaVA‑OneVision‑7B) yields marginal AP improvements, suggesting that scaling the LLM benefits reasoning more than visual representation.

Caption Quality

Substituting Qwen2‑VL‑72b captions with those from LLaVA‑OneVision‑7B degrades performance noticeably, and further replacing with COCO captions or short localization texts reduces AP by an additional 0.4 %. Human evaluation using GPT‑4o confirms that GroundingCap‑1M captions have the highest detail scores and low hallucination rates (0.90 vs. 1.34 for LLaVA).

Pre‑training of the Projection Layer

Pre‑training the projector before end‑to‑end fine‑tuning preserves alignment and improves rare‑category AP; omitting this step harms performance.

Conclusion

LLMDet introduces a new training objective that fuses large‑scale image‑level descriptions and region‑level phrases from LLMs with standard detection losses. This joint supervision yields richer visual‑language representations, leading to state‑of‑the‑art zero‑shot results on diverse open‑vocabulary benchmarks. The work also shows that a stronger detector can, in turn, serve as a foundation for more powerful multimodal models.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

large language modelszero-shotopen-vocabulary detectionGroundingCapLLMDet
AIWalker
Written by

AIWalker

Focused on computer vision, image processing, color science, and AI algorithms; sharing hardcore tech, engineering practice, and deep insights as a diligent AI technology practitioner.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.