YOLOE: Open‑Source Real‑Time Anything Detector Beats YOLO‑World v2
YOLOE unifies object detection and segmentation in a single efficient model that supports text, visual, and prompt‑free inference, introduces RepRTA, SAVPE, and LRPC strategies, and achieves higher AP with up to three‑fold lower training cost and 1.4× faster inference on GPUs and mobile devices, as demonstrated by extensive LVIS and COCO experiments.
1. Introduction
Object detection and segmentation are fundamental computer‑vision tasks, but traditional YOLO models rely on a closed set of categories, limiting their adaptability in open‑world scenarios. Recent open‑set approaches use text prompts, visual cues, or prompt‑free paradigms, yet they often incur high computational cost or complex deployment.
This work proposes YOLOE, a unified, efficient model that handles multiple open Prompt mechanisms—text, visual, and prompt‑free—while maintaining real‑time performance. For text prompts, the authors introduce a Re‑parameterizable Region‑Text Alignment (RepRTA) strategy that refines pre‑trained CLIP text embeddings with a lightweight auxiliary network, adding negligible training overhead and no extra inference cost. For visual prompts, they design a Semantic‑Activated Visual Prompt Encoder (SAVPE) that decouples a semantic branch and an activation branch to generate low‑dimensional prompt‑aware weights with minimal complexity. In prompt‑free settings, they propose Lazy Region‑Prompt Contrast (LRPC), which leverages an internal large‑vocabulary lookup to identify objects without relying on large language models.
2. Related Work
Traditional closed‑set detectors (e.g., Faster R‑CNN, YOLO series) and recent transformer‑based detectors (DETR, DINO) are surveyed. Open‑vocabulary detection methods such as GLIP, DetCLIP, Grounding DINO, and YOLO‑World are discussed, highlighting their reliance on cross‑modal fusion or language models that increase computation. Visual‑prompt methods (OV‑DETR, OWL‑ViT, T‑Rex2) and prompt‑free approaches (GenerateU, GRiT) are also reviewed, establishing the gap that YOLOE aims to fill.
3. Method
3.1 Model Architecture
YOLOE follows the classic YOLO backbone‑PAN design, adding a segmentation head and a target‑embedding head. The final convolutional layer of the classification head outputs an embedding vector instead of fixed class logits, enabling comparison with any number of Prompt embeddings.
3.2 Re‑parameterizable Region‑Text Alignment (RepRTA)
For each text Prompt of length L, a CLIP text encoder produces a pre‑trained embedding. These embeddings are cached offline, so no extra cost is incurred during training. A lightweight auxiliary network (one feed‑forward block with θ parameters) refines the embeddings, producing an enhanced text vector f_θ(P). The alignment loss compares this vector with the target embedding at each Anchor point:
\mathrm{label}=R_{D\times H\times W\rightarrow H W\times D}(I\circledast K)\cdot(f_{\theta}(P))^{T}After training, the auxiliary network is folded into the classification head, preserving the original YOLO inference graph.
3.3 Semantic‑Activated Visual Prompt Encoder (SAVPE)
SAVPE contains two branches. The semantic branch processes multi‑scale PAN features through two convolutions per scale, concatenates and projects them to obtain prompt‑independent semantic features. The activation branch treats the visual Prompt as a binary mask, downsamples it, and convolves to obtain Prompt features, which are fused with image features from the backbone. The fused features are split into G groups, each producing a Prompt‑aware weight that is softmax‑normalized within the masked region. The two branches are summed to form the final Prompt embedding, which is then compared with Anchor‑point embeddings.
3.4 Lazy Region‑Prompt Contrast (LRPC)
In the prompt‑free case, YOLOE treats detection as a retrieval problem. A built‑in vocabulary of V=4585 class names is stored. During inference, only Anchor points with confidence above a threshold τ are matched against the vocabulary, avoiding exhaustive comparison and eliminating language‑model overhead.
3.5 Training Objectives
Following prior work, YOLOE uses binary cross‑entropy for classification, IoU loss plus focal loss for bounding‑box regression, and binary cross‑entropy for mask prediction. Text Prompt training follows the protocol of YOLO‑World, while visual Prompt training freezes all but the SAVPE modules for two epochs. Prompt‑free training optimises the dedicated Prompt embedding for one epoch.
4. Experiments
4.1 Implementation Details
YOLOE is built on the YOLOv8 architecture and also evaluated on YOLO‑11. Three model scales (S, M, L) are trained. Text Prompts use a pre‑trained MobileCLIP‑B(LT) encoder; visual Prompts use the SAVPE default settings. Datasets include Objects365, GoldG, and pseudo‑masks generated by SAM‑2.1. Training uses 30 epochs for text Prompt, 2 epochs for SAVPE, and 1 epoch for Prompt‑free embedding, achieving 3× lower training cost than YOLO‑World v2.
4.2 Prompt Evaluation
On LVIS, YOLOE‑v8‑S/M/L surpass YOLO‑World v2‑S/M/L by 3.5/0.2/0.4 AP respectively, while delivering 1.4×/1.3×/1.3× higher FPS on an Nvidia T4 and 1.3×/1.2×/1.2× on an iPhone 12. Rare‑class AP<sub>r</sub> improves by 5.2 % (S) and 7.6 % (L). Compared with T‑Rex2, YOLOE‑v8‑L gains 3.3 AP<sub>r</sub> and 0.4 AP while using half the training data and fewer GPUs.
4.3 Prompt‑Free Evaluation
YOLO‑v8‑L achieves 27.2 AP and 23.5 AP<sub>r</sub>, outperforming GenerateU by 0.4 AP and 3.5 AP<sub>r</sub> while using 6.3× fewer parameters and 53× faster inference.
4.4 Downstream Transfer
When fine‑tuned on COCO, linear detection (only classification head trainable) reaches >80 % of YOLO‑11‑M performance with <2 % of the training time. Full fine‑tuning further narrows the gap, delivering 0.4–0.6 AP improvements over YOLOv8‑M/L with roughly four‑fold fewer epochs.
4.5 Ablation Studies
Removing the cross‑modal fusion module reduces AP but speeds up inference; adding a stronger MobileCLIP encoder recovers AP. RepRTA consistently adds several AP points without extra cost. SAVPE’s activation branch yields a 1.5 AP gain over simple mask‑pooling, and varying group count shows diminishing returns beyond one group. LRPC provides comparable AP to a baseline that uses the full vocabulary but cuts inference time by up to 2×; adjusting the threshold τ trades a 0.2 AP drop for a 1.5× speed boost.
5. Conclusion
YOLOE demonstrates that a single, efficient architecture can jointly perform detection and segmentation across text, visual, and prompt‑free modalities. The RepRTA, SAVPE, and LRPC components enable real‑time, open‑world perception with significantly reduced training cost and hardware requirements, establishing a strong baseline for future research.
References
[1] YOLOE: Real‑Time Seeing Anything.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
AIWalker
Focused on computer vision, image processing, color science, and AI algorithms; sharing hardcore tech, engineering practice, and deep insights as a diligent AI technology practitioner.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
