ICML 2026: Certifying VLM Robustness with Text‑Prompted Semantic Intervals
This paper introduces a semantic robustness certification framework for vision‑language models that leverages paired text prompts as semantic proxies to define a continuous transformation in the shared embedding space, derives closed‑form interval bounds where predictions remain unchanged, and validates the method on CLIP ViT‑B/32 with both synthetic and real‑world datasets.
Background
Vision‑language models (VLMs) such as CLIP are now core components for open‑vocabulary recognition, image‑text retrieval, detection, segmentation and visual question answering. In real deployments images undergo semantic changes—shape, size, style, background, viewpoint, illumination—that are not captured by traditional pixel‑level or geometric robustness certifications.
Related Work
Existing VLM robustness studies focus on distribution shift, adversarial attacks, multimodal security, robust optimization, distillation and interpretability, but they do not provide closed‑form prediction‑invariant intervals. Classical certification methods (random smoothing, PixelDP, DeepPoly, CROWN, PRIMA, ReluVal, branch‑and‑bound) target pixel perturbations or internal network structure and cannot directly express open‑vocabulary semantic changes. Recent works such as DeepG, GeoRobust, ApproxLine and GCERT model input transformations via geometry or generative latent spaces, yet they require training or data for each semantic direction.
Problem Definition
The paper considers a dual‑encoder VLM where an image x is encoded to a unit embedding z and a class prompt c to a unit embedding u_c. Classification selects the class with maximal cosine similarity (inner product) between z and u_c. A semantic transformation γ(φ) is defined by a source prompt embedding u_a and a target prompt embedding u_a′; the two embeddings span a two‑dimensional semantic plane. The image embedding is decomposed into a component z∥ lying in this plane and an orthogonal component z⊥. The semantic extent φ controls the angle of z∥ between the source and target directions.
Method
1. Text‑based semantic proxy : a pair of prompts (e.g., “a photo of a gyoza” and “a photo of triangular gyoza”) defines the semantic plane.
2. Semantic transformation : only the in‑plane component z∥ is moved along the plane while z⊥ remains unchanged. The endpoint can be specified either by the target prompt embedding (text‑specified) or by an image embedding of a reference example (image‑specified).
3. Closed‑form certification : VLM decision boundaries are pairwise bisectors of class text embeddings, forming Voronoi regions on the unit sphere. Substituting γ(φ) into the margin equations yields analytic expressions for the φ values where class switches occur. Collecting all switch points and sorting them partitions the semantic extent into intervals where the predicted class is invariant.
Experiments
All experiments use CLIP ViT‑B/32. Qualitative results show how certificate intervals vary with descriptors such as color, shape, material, style, texture, background, viewpoint and illumination (e.g., a “wallflower” image remains stable for “red flower” but flips for “spiral flower”). Quantitative evaluation introduces a misalignment budget δ to model cross‑modal embedding gaps; increasing δ reduces stable coverage but empirical and conditional invariance remain high, indicating conservative yet reliable certificates.
Synthetic semantic changes are generated with a multimodal LLM across OxfordPets, Flowers102 and Food101. The mean absolute discrepancy between the constructed transformation and the reference semantic change is reported for ExactLine, text‑specified (T‑Spec) and image‑specified (I‑Spec) methods, with lower values indicating better alignment. On eight real‑world datasets (DTD, FGVCAircraft, Caltech101, StanfordCars, Flowers102, OxfordPets, Food101, UCF101) the authors approximate semantic sequences by sorting images according to prompt similarity. The proposed method consistently yields longer stable intervals than ExactLine; I‑Spec usually outperforms T‑Spec because it leverages a concrete target image.
Discussion
Certificates can be used for robustness auditing (e.g., checking stability under “darker color” or “street background”), failure‑mode diagnosis (short intervals reveal sensitivity), prompt engineering (interval length guides prompt selection), and downstream tasks that reuse the image‑text scoring function such as retrieval, detection and segmentation. Limitations include dependence on the quality of text proxies and the degree of image‑text alignment, and the difficulty of isolating pure semantic changes in real data, which may mix multiple factors.
Conclusion
The work presents a semantic robustness certification framework for VLMs that defines parametric semantic transformations via text prompts, exploits the closed‑form geometry of VLM decision boundaries, and computes certifiable semantic‑extent intervals without training additional generative models or collecting extra annotations. This advances robustness analysis from pixel‑level perturbations to open‑vocabulary semantic shifts, offering a tool for model audit, drift monitoring, prompt selection and failure analysis.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Data Party THU
Official platform of Tsinghua Big Data Research Center, sharing the team's latest research, teaching updates, and big data news.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
