How Anchored Attributes Boost Prompt Learning for Vision‑Language Models
The paper introduces ATPrompt, a method that inserts fixed attribute tokens into learnable prompts for CLIP‑style vision‑language models, enabling the soft prompts to capture generic attribute representations and significantly improve base‑to‑novel generalization without extra regularization losses.
Paper Overview
Title: Advancing Textual Prompt Learning with Anchored Attributes
ArXiv: https://arxiv.org/abs/2412.09442
Code: https://github.com/zhengli97/ATPrompt
Keywords: prompt learning, multimodal learning, CLIP
Background
Vision‑Language Models (VLMs) such as CLIP consist of an image encoder (e.g., ResNet or ViT) and a text encoder (a transformer). For classification, class names are inserted into a textual template (e.g., “a photo of a {class}”) and encoded; the image and text features are multiplied to obtain logits.
Prompt learning replaces hand‑crafted templates with learnable soft tokens (CoOp). While this improves performance on known classes, the learned prompts remain centered on class tokens, limiting their ability to generalize to unseen categories.
Problem Statement
The conventional prompt format [soft tokens] + [class token] can only capture representations tied to known classes. Consequently, when encountering novel classes the prompt lacks generic attribute knowledge, leading to poor base‑to‑novel transfer.
ATPrompt Method
ATPrompt inserts fixed attribute (anchor) tokens between the soft tokens and the class token:
[soft tokens] + [anchor attribute tokens] + [class token]Because the anchor tokens are constant, the learnable soft tokens are guided to capture attribute‑related, class‑agnostic representations, improving generalization.
Deep ATPrompt
A deep variant inserts anchor tokens at multiple transformer layers, allowing attribute information to influence deeper representations.
Differentiable Attribute Search
To avoid manual attribute selection, ATPrompt employs a two‑step differentiable search:
Use a large language model (LLM) to generate a set of generic attribute bases (e.g., shape, color, material, function, size).
Form all possible attribute combinations (31 for five bases) and learn a weight vector over them. Alternating optimization updates soft‑token weights and combination weights for 40 epochs; the highest‑weight combination is chosen as the final attribute set.
Example on Caltech‑101 selects the combination (shape, size) as the optimal attribute set.
Experimental Results
ATPrompt consistently improves the harmonic mean of base and novel class accuracy across 11 benchmarks when integrated into existing baselines.
Cross‑dataset evaluation also shows gains.
Ablation Studies
Attribute relevance: Using the searched attributes yields the best performance; random generic attributes still help, while irrelevant attributes degrade performance only slightly.
Attribute order: Permuting the order of anchor tokens changes results by less than 0.2 % harmonic mean, confirming order is not critical.
Deep version operations: Various token‑drop/re‑add strategies were compared; the proposed deep‑ATPrompt configuration achieves the highest scores.
Vision and Future Work
ATPrompt demonstrates that redesigning the prompt structure to include anchored attributes is a more principled solution than adding ever‑more regularization terms. Future work should explore richer anchor vocabularies and automated discovery of attribute sets for diverse domains.
Frequently Asked Questions
Why fine‑tune VLMs with prompt learning?
Full‑parameter fine‑tuning requires large image‑text datasets and risks over‑fitting, whereas prompt learning introduces only a few learnable tokens, preserving CLIP’s zero‑shot capabilities while adapting to specific tasks.
Is an attribute learned per sample?
No. The attribute search yields a single attribute combination for the entire dataset, which is then used throughout ATPrompt training.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
