YOLOE: Real‑Time Open‑World Object Detection and Segmentation Unveiled
The paper introduces YOLOE, a new YOLO‑based model that supports text, visual, and no‑prompt open‑world detection and segmentation, detailing its lightweight RepRTA, SAVPE, and LRPC modules and showing benchmark gains in speed and zero‑shot performance on LVIS and COCO.
Background
Since the original YOLO was introduced in 2015, the series has excelled at fast closed‑set object detection but requires a predefined class dictionary, limiting flexibility in open environments.
YOLOE Overview
YOLOE extends the classic YOLO architecture (backbone, PAN, regression head, segmentation head) with an object‑embedding head . The final 1× convolution outputs an embedding vector E instead of a fixed class score, enabling open‑world detection. Three prompting modes are supported:
Textual prompts encoded by RepRTA (Re‑parameterizable Region‑Text Alignment).
Visual prompts encoded by SAVPE (Semantic‑Activated Visual Prompt Encoder).
A prompt‑free mode handled by LRPC (Lazy Region‑Prompt Contrast).
RepRTA
During training, a lightweight auxiliary network refines pretrained text embeddings T into normalized prompt embeddings P. The refinement is re‑parameterizable, so at inference time the extra network is folded into the main model, incurring zero additional cost. This improves alignment between text embeddings and anchor embeddings E without affecting latency.
SAVPE
SAVPE consists of two decoupled branches:
Semantic branch : produces prompt‑independent features of dimension D channels.
Activation branch : interacts the visual prompt V with image features in a low‑cost subset of channels, generating grouped, prompt‑aware weights.
The two branches are aggregated to form a rich visual‑prompt embedding P_v while keeping computational overhead minimal.
LRPC
Open‑world detection is reformulated as a retrieval problem. For each anchor, LRPC lazily looks up the most likely class name from a built‑in large vocabulary instead of generating names with a language model. This eliminates language‑model dependence and preserves real‑time efficiency.
Experimental Setup
YOLOE variants were built on top of YOLOv8 and YOLOv11 at three scales (S, M, L). All models were trained on the LVIS benchmark using the same data‑augmentation pipeline as YOLO‑Worldv2. Training time, average precision (AP), and inference latency were measured on an NVIDIA T4 GPU and an iPhone 12.
Results on LVIS
Key numbers (YOLOv8‑based models):
Training time reduced by ≈3× compared with YOLO‑Worldv2.
AP improvement over YOLO‑Worldv2: +3.5 (S), +0.2 (M), +0.4 (L).
Inference speed increase on T4: 1.4× (S), 1.3× (M), 1.3× (L).
Inference speed increase on iPhone 12: 1.3× (S), 1.2× (M), 1.2× (L).
YOLOE‑v8‑M/L show slightly lower AP than YOLO‑Worldv2‑M/L; the authors attribute this to the integration of detection and segmentation in a single model, which introduces a modest trade‑off between mask quality and classification accuracy.
Additional Evaluations
Segmentation performance (Table 2): YOLOE achieves mask quality comparable to state‑of‑the‑art open‑world segmenters.
No‑prompt evaluation (Table 3): Zero‑shot detection works well even without any textual cue.
Transferability on COCO (Table 4): Linear probing and full fine‑tuning both retain strong performance, demonstrating that the learned embeddings transfer effectively to downstream tasks.
Visual Scenarios
Four qualitative cases illustrate the prompting flexibility:
Zero‑shot inference on LVIS using class names as text prompts.
Arbitrary user‑provided text prompts.
Visual cues supplied as prompts (e.g., drawn bounding boxes).
Prompt‑free mode where the model autonomously discovers all objects.
In all scenarios YOLOE delivers accurate detection and segmentation in real time.
References & Resources
Paper: https://arxiv.org/abs/2503.07465
Demo and code repository: https://github.com/THU-MIG/yoloe?tab=readme-ov-file#demo
Figures
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
AIWalker
Focused on computer vision, image processing, color science, and AI algorithms; sharing hardcore tech, engineering practice, and deep insights as a diligent AI technology practitioner.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
