ICCV Spotlight: Pixel Tracing for Copy Detection and Skip-Vision Model Acceleration
The ICCV 2025 live session will deep‑dive into two cutting‑edge papers—PixTrace with CopyNCE for precise image copy detection and Skip‑Vision for dramatically faster training and inference of vision‑language models—showcasing their methods, results, and real‑world impact.
Paper 1: Tracing Copied Pixels and Regularizing Patch Affinity in Copy Detection
Image copyright detection aims to identify infringing relationships between image pairs by learning robust feature representations. Existing view‑level contrastive methods struggle with fine‑grained correspondence, limiting detection of complex local infringements.
To overcome this, the authors introduce a pixel‑tracking module called PixTrace , which maintains spatial mappings of infringing and original pixels across arbitrary edits. Building on PixTrace, they propose the CopyNCE geometric‑guided contrastive loss, which uses the pixel mappings to compute overlap rates between patch pairs and regularizes their affinity.
Experiments show that CopyNCE achieves state‑of‑the‑art performance on multiple image‑copy‑detection benchmarks and demonstrates strong transferability to video copy detection, while also offering superior interpretability compared with prior methods.
Paper 2: Skip‑Vision: Efficient and Scalable Acceleration of Vision‑Language Models via Adaptive Token Skipping
Transformer‑based models have propelled multimodal large language models (MLLMs), but their computational cost grows sharply with image resolution, data size, and model parameters. The core bottleneck is the explosion of visual tokens required for fine‑grained image understanding.
The Skip‑Vision framework addresses inefficiencies during both training and inference. For training, the authors observe that the feed‑forward network (FFN) updates visual token features only marginally, leading to the SkipFFN strategy that bypasses redundant token computations. For inference, they design a selective KV‑cache removal mechanism that prunes skipped key‑value pairs while preserving model performance.
Results indicate that Skip‑Vision can reduce training time by up to 35%, cut inference FLOPs by 75%, and lower latency by 45%, all while matching or surpassing existing baselines.
The live session will feature the authors presenting the design ideas and validation processes of these technologies.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
