YOLOE: Real‑Time Open‑World Object Detection and Segmentation Unveiled

The paper introduces YOLOE, a new YOLO‑based model that supports text, visual, and no‑prompt open‑world detection and segmentation, detailing its lightweight RepRTA, SAVPE, and LRPC modules and showing benchmark gains in speed and zero‑shot performance on LVIS and COCO.

AIWalker
AIWalker
AIWalker
YOLOE: Real‑Time Open‑World Object Detection and Segmentation Unveiled

Background

Since the original YOLO was introduced in 2015, the series has excelled at fast closed‑set object detection but requires a predefined class dictionary, limiting flexibility in open environments.

YOLOE Overview

YOLOE extends the classic YOLO architecture (backbone, PAN, regression head, segmentation head) with an object‑embedding head . The final 1× convolution outputs an embedding vector E instead of a fixed class score, enabling open‑world detection. Three prompting modes are supported:

Textual prompts encoded by RepRTA (Re‑parameterizable Region‑Text Alignment).

Visual prompts encoded by SAVPE (Semantic‑Activated Visual Prompt Encoder).

A prompt‑free mode handled by LRPC (Lazy Region‑Prompt Contrast).

RepRTA

During training, a lightweight auxiliary network refines pretrained text embeddings T into normalized prompt embeddings P. The refinement is re‑parameterizable, so at inference time the extra network is folded into the main model, incurring zero additional cost. This improves alignment between text embeddings and anchor embeddings E without affecting latency.

SAVPE

SAVPE consists of two decoupled branches:

Semantic branch : produces prompt‑independent features of dimension D channels.

Activation branch : interacts the visual prompt V with image features in a low‑cost subset of channels, generating grouped, prompt‑aware weights.

The two branches are aggregated to form a rich visual‑prompt embedding P_v while keeping computational overhead minimal.

LRPC

Open‑world detection is reformulated as a retrieval problem. For each anchor, LRPC lazily looks up the most likely class name from a built‑in large vocabulary instead of generating names with a language model. This eliminates language‑model dependence and preserves real‑time efficiency.

Experimental Setup

YOLOE variants were built on top of YOLOv8 and YOLOv11 at three scales (S, M, L). All models were trained on the LVIS benchmark using the same data‑augmentation pipeline as YOLO‑Worldv2. Training time, average precision (AP), and inference latency were measured on an NVIDIA T4 GPU and an iPhone 12.

Results on LVIS

Key numbers (YOLOv8‑based models):

Training time reduced by ≈3× compared with YOLO‑Worldv2.

AP improvement over YOLO‑Worldv2: +3.5 (S), +0.2 (M), +0.4 (L).

Inference speed increase on T4: 1.4× (S), 1.3× (M), 1.3× (L).

Inference speed increase on iPhone 12: 1.3× (S), 1.2× (M), 1.2× (L).

YOLOE‑v8‑M/L show slightly lower AP than YOLO‑Worldv2‑M/L; the authors attribute this to the integration of detection and segmentation in a single model, which introduces a modest trade‑off between mask quality and classification accuracy.

Additional Evaluations

Segmentation performance (Table 2): YOLOE achieves mask quality comparable to state‑of‑the‑art open‑world segmenters.

No‑prompt evaluation (Table 3): Zero‑shot detection works well even without any textual cue.

Transferability on COCO (Table 4): Linear probing and full fine‑tuning both retain strong performance, demonstrating that the learned embeddings transfer effectively to downstream tasks.

Visual Scenarios

Four qualitative cases illustrate the prompting flexibility:

Zero‑shot inference on LVIS using class names as text prompts.

Arbitrary user‑provided text prompts.

Visual cues supplied as prompts (e.g., drawn bounding boxes).

Prompt‑free mode where the model autonomously discovers all objects.

In all scenarios YOLOE delivers accurate detection and segmentation in real time.

References & Resources

Paper: https://arxiv.org/abs/2503.07465

Demo and code repository: https://github.com/THU-MIG/yoloe?tab=readme-ov-file#demo

Figures

YOLOE architecture diagram
YOLOE architecture diagram
LVIS zero‑shot detection results
LVIS zero‑shot detection results
Segmentation evaluation
Segmentation evaluation
No‑prompt evaluation
No‑prompt evaluation
Transferability on COCO
Transferability on COCO
Qualitative visualizations
Qualitative visualizations
Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

computer visionobject detectionbenchmarkzero-shotvisual promptingYOLOEopen-world
AIWalker
Written by

AIWalker

Focused on computer vision, image processing, color science, and AI algorithms; sharing hardcore tech, engineering practice, and deep insights as a diligent AI technology practitioner.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.