ICML 2026: Certifying VLM Robustness with Text‑Prompted Semantic Intervals

This paper introduces a semantic robustness certification framework for vision‑language models that leverages paired text prompts as semantic proxies to define a continuous transformation in the shared embedding space, derives closed‑form interval bounds where predictions remain unchanged, and validates the method on CLIP ViT‑B/32 with both synthetic and real‑world datasets.

Data Party THU
Data Party THU
Data Party THU
ICML 2026: Certifying VLM Robustness with Text‑Prompted Semantic Intervals

Background

Vision‑language models (VLMs) such as CLIP are now core components for open‑vocabulary recognition, image‑text retrieval, detection, segmentation and visual question answering. In real deployments images undergo semantic changes—shape, size, style, background, viewpoint, illumination—that are not captured by traditional pixel‑level or geometric robustness certifications.

Related Work

Existing VLM robustness studies focus on distribution shift, adversarial attacks, multimodal security, robust optimization, distillation and interpretability, but they do not provide closed‑form prediction‑invariant intervals. Classical certification methods (random smoothing, PixelDP, DeepPoly, CROWN, PRIMA, ReluVal, branch‑and‑bound) target pixel perturbations or internal network structure and cannot directly express open‑vocabulary semantic changes. Recent works such as DeepG, GeoRobust, ApproxLine and GCERT model input transformations via geometry or generative latent spaces, yet they require training or data for each semantic direction.

Problem Definition

The paper considers a dual‑encoder VLM where an image x is encoded to a unit embedding z and a class prompt c to a unit embedding u_c. Classification selects the class with maximal cosine similarity (inner product) between z and u_c. A semantic transformation γ(φ) is defined by a source prompt embedding u_a and a target prompt embedding u_a′; the two embeddings span a two‑dimensional semantic plane. The image embedding is decomposed into a component z∥ lying in this plane and an orthogonal component z⊥. The semantic extent φ controls the angle of z∥ between the source and target directions.

Method

1. Text‑based semantic proxy : a pair of prompts (e.g., “a photo of a gyoza” and “a photo of triangular gyoza”) defines the semantic plane.

2. Semantic transformation : only the in‑plane component z∥ is moved along the plane while z⊥ remains unchanged. The endpoint can be specified either by the target prompt embedding (text‑specified) or by an image embedding of a reference example (image‑specified).

3. Closed‑form certification : VLM decision boundaries are pairwise bisectors of class text embeddings, forming Voronoi regions on the unit sphere. Substituting γ(φ) into the margin equations yields analytic expressions for the φ values where class switches occur. Collecting all switch points and sorting them partitions the semantic extent into intervals where the predicted class is invariant.

Experiments

All experiments use CLIP ViT‑B/32. Qualitative results show how certificate intervals vary with descriptors such as color, shape, material, style, texture, background, viewpoint and illumination (e.g., a “wallflower” image remains stable for “red flower” but flips for “spiral flower”). Quantitative evaluation introduces a misalignment budget δ to model cross‑modal embedding gaps; increasing δ reduces stable coverage but empirical and conditional invariance remain high, indicating conservative yet reliable certificates.

Synthetic semantic changes are generated with a multimodal LLM across OxfordPets, Flowers102 and Food101. The mean absolute discrepancy between the constructed transformation and the reference semantic change is reported for ExactLine, text‑specified (T‑Spec) and image‑specified (I‑Spec) methods, with lower values indicating better alignment. On eight real‑world datasets (DTD, FGVCAircraft, Caltech101, StanfordCars, Flowers102, OxfordPets, Food101, UCF101) the authors approximate semantic sequences by sorting images according to prompt similarity. The proposed method consistently yields longer stable intervals than ExactLine; I‑Spec usually outperforms T‑Spec because it leverages a concrete target image.

Discussion

Certificates can be used for robustness auditing (e.g., checking stability under “darker color” or “street background”), failure‑mode diagnosis (short intervals reveal sensitivity), prompt engineering (interval length guides prompt selection), and downstream tasks that reuse the image‑text scoring function such as retrieval, detection and segmentation. Limitations include dependence on the quality of text proxies and the degree of image‑text alignment, and the difficulty of isolating pure semantic changes in real data, which may mix multiple factors.

Conclusion

The work presents a semantic robustness certification framework for VLMs that defines parametric semantic transformations via text prompts, exploits the closed‑form geometry of VLM decision boundaries, and computes certifiable semantic‑extent intervals without training additional generative models or collecting extra annotations. This advances robustness analysis from pixel‑level perturbations to open‑vocabulary semantic shifts, offering a tool for model audit, drift monitoring, prompt selection and failure analysis.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

vision-language modelsCLIPsemantic robustnessembedding geometryrobustness certificationtext prompts
Data Party THU
Written by

Data Party THU

Official platform of Tsinghua Big Data Research Center, sharing the team's latest research, teaching updates, and big data news.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.