How KgCoOp Uses Knowledge‑Guided Context Optimization to Prevent Prompt Tuning Forgetting

The article analyzes why standard prompt tuning (CoOp) causes catastrophic forgetting in visual‑language models, introduces the KgCoOp framework that adds a knowledge‑guided loss to regularize prompts, and shows through extensive experiments on 11 benchmarks that KgCoOp improves unseen‑class accuracy, harmonic mean, and efficiency while discussing trade‑offs and limitations.

DeepHub IMBA
DeepHub IMBA
DeepHub IMBA
How KgCoOp Uses Knowledge‑Guided Context Optimization to Prevent Prompt Tuning Forgetting

Background

Visual‑language models (VLMs) such as CLIP are trained on massive image‑text pairs and can recognize classes that were never explicitly seen during training. When these models are adapted to downstream tasks via Prompt Tuning—specifically Context Optimization (CoOp)—they become specialists on the base classes but inevitably sacrifice the original, generic knowledge, a problem known as the Base‑to‑New generalization dilemma.

Why CoOp Fails on Unseen Classes

CoOp replaces hand‑crafted templates (e.g., "a photo of a [Class]") with learnable context vectors, boosting performance on the base (seen) categories. However, with only a few labeled samples the model learns prompts that are discriminative for those categories alone, drifting far from the original CLIP prompt. Empirical data across 11 benchmarks reveal that standard CoOp raises base accuracy while dropping new‑class performance below the zero‑shot baseline.

The larger the Euclidean distance between the learned prompt and the hand‑crafted prompt, the more severe the degradation on unseen classes.

KgCoOp: Knowledge‑Guided Context Optimization

KgCoOp introduces a knowledge‑guided loss L_{kg} that minimizes the Euclidean distance between the fine‑tuned prompt embedding w_i^{coop} and the original CLIP prompt embedding w_i^{clip}. The overall training objective becomes: L = L_{ce} + \lambda \cdot L_{kg}, where L_{ce} is the standard cross‑entropy loss and \lambda balances the new constraint.

By pulling the learned prompt toward the CLIP anchor, KgCoOp forces the model to retain the generic pre‑training features while still adapting to the target task.

Experimental Setup

KgCoOp is evaluated on eleven diverse image‑classification benchmarks (ImageNet, Caltech101, OxfordPets, StanfordCars, Flowers102, Food101, FGVCAircraft, EuroSAT, UCF101, DTD, SUN397) using ResNet‑50 and ViT‑B/16 backbones under a 16‑shot setting.

Results

KgCoOp achieves the highest harmonic mean across all settings. Compared with standard CoOp, KgCoOp improves new‑class accuracy by 5.61 % (ViT‑B/16, 16‑shot) and outperforms CoCoOp by 1.91 %. It also surpasses ProGrad on new classes while keeping comparable base performance.

In domain‑generalization experiments (ImageNet‑V2, Sketch, A, R), KgCoOp consistently yields the best unseen‑class scores.

Efficiency and Compatibility

Computing the Euclidean distance adds negligible overhead. KgCoOp’s throughput reaches 6 M samples / s, far faster than CoCoOp (≈26× slower) and ProGrad (≈22 ms / image). The L_{kg} constraint can be plugged into other prompt‑tuning frameworks; adding it to CoCoOp and ProGrad raises their new‑class accuracy and harmonic mean.

Ablation Studies

λ sensitivity: Performance peaks at \lambda = 8.0. Larger values over‑constrain the prompt, reducing base‑class accuracy.

Context length M: Increasing the number of context tokens from 4 to 8 yields consistent gains on both base and new classes.

Limitations

Stronger L_{kg} improves unseen‑class scores but can slightly depress base‑class performance, reflecting a trade‑off between stability and adaptability. Selecting an appropriate \lambda adds an extra hyper‑parameter tuning burden.

Conclusion

KgCoOp provides a lightweight, geometry‑based regularization that mitigates catastrophic forgetting in prompt tuning, delivering a practical balance between task‑specific accuracy and zero‑shot generalization while remaining computationally efficient and compatible with existing frameworks.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

Prompt TuningVisual-Language ModelsResNetViTZero-shot LearningCatastrophic ForgettingKnowledge-guided Optimization
DeepHub IMBA
Written by

DeepHub IMBA

A must‑follow public account sharing practical AI insights. Follow now. internet + machine learning + big data + architecture = IMBA

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.