Zero-Shot Domain Adaptation for Object Detection: How UPRE Boosts Cross-Domain Performance

The UPRE framework introduces multi‑view domain prompts and unified representation enhancement to achieve zero‑shot domain adaptation for object detection, dramatically improving detection accuracy on unseen target domains across diverse visual scenarios.

Amap Tech
Amap Tech
Amap Tech
Zero-Shot Domain Adaptation for Object Detection: How UPRE Boosts Cross-Domain Performance

ICCV (International Conference on Computer Vision) is a top‑tier computer‑vision conference recommended by the China Computer Federation (CCF) as an A‑class international event. The 2024 edition will be held in Hawaii, USA, with an acceptance rate of 24% (2,698 papers accepted out of 11,239 submissions). The Gaode technology team contributed five accepted papers.

Paper Title

UPRE: Zero‑Shot Domain Adaptation for Object Detection via Unified Prompt and Representation Enhancement

Paper link: https://arxiv.org/pdf/2507.00721

Abstract

UPRE jointly optimizes textual prompts and visual representations by designing a Multi‑view Domain Prompt (MDP) and a Unified Representation Enhancement (URE) module. MDP provides language‑modal priors for the target domain and captures diverse adaptation knowledge needed for cross‑domain object detection. URE generates target‑domain representations from source‑domain data to reduce domain shift. Two enhancement strategies—Relative Domain Distance (RDD) and Positive‑Negative Separation (PNS)—form a multi‑level training framework that significantly improves adaptation and detection performance. Experiments show that UPRE excels on three domain‑adaptation tasks, markedly boosting detector performance on unseen domains.

Research Background

Domain‑adaptation methods have attracted extensive research because they improve model generalization. However, acquiring even unlabeled image priors in practice is difficult, limiting their applicability. Zero‑Shot Domain Adaptation (ZSDA) aims to adapt to a target domain without any image priors, addressing the dependency on target data. Recent advances in Visual‑Language Models (VLMs) enable ZSDA by using textual prompts to describe unseen target domains, leveraging their inherent zero‑shot capabilities.

Existing VLM‑based approaches face two main challenges: domain bias (distribution differences between source and target introduce task‑irrelevant noise) and detection bias (emphasis on holistic image representation neglects instance‑level details needed for precise localization). Manual prompts also fail to capture contextual attributes of foreground and background objects.

Key Contributions

Multi‑view Domain Prompt (MDP) : Combines static and learnable dynamic prompts to retain human‑defined background information while allowing the model to acquire multi‑view knowledge for cross‑domain adaptation and object localization.

Unified Representation Enhancement (URE) : Generates target‑domain representations across visual and language modalities, mitigating domain bias. It incorporates learnable mean‑enhancement and bias‑enhancement modules to perform fine‑grained style adjustments on source features, producing pseudo‑target features that improve adaptability under varying styles.

Multi‑level Enhancement Strategies :

Relative Domain Distance (RDD) – an image‑level enhancement that balances semantic preservation and style diversity during source‑to‑target feature transformation.

Positive‑Negative Separation (PNS) – an instance‑level enhancement that refines proposal selection by narrowing the search space for foreground objects and better distinguishing background classes.

These strategies are jointly trained within a unified framework, alleviating detection bias and boosting performance on complex domain‑adaptation tasks.

Experimental Results

Qualitative Analysis & Generalization

t‑SNE visualizations of image embeddings across five domains show that CLIP provides coarse generalization, whereas UPRE achieves superior adaptation for each target domain.

Main Comparison Experiments

We evaluated UPRE on nine open‑source datasets across three challenging domain‑adaptation scenarios: (1) adverse weather conditions, (2) cross‑city geographic shifts, and (3) synthetic‑to‑real transfer. Results demonstrate that UPRE consistently outperforms state‑of‑the‑art methods in detection accuracy and robustness.

Quantitative Tables (summarized)

Tables and figures (omitted for brevity) compare UPRE with OA‑DG, CLIP‑GAP, and other baselines under varying weather, city, and synthetic‑real settings, highlighting its superior performance.

Conclusion & Outlook

UPRE shows strong potential for enhancing domain adaptation and detection capabilities in complex environments. Future work will focus on improving robustness to more diverse real‑world scenarios, efficient utilization of cross‑domain data, and integration with other machine‑learning techniques to broaden the impact of domain‑adaptation research.

object detectionprompt engineeringvisual-language modelscross-domain learningzero-shot domain adaptation
Amap Tech
Written by

Amap Tech

Official Amap technology account showcasing all of Amap's technical innovations.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.