Tagged articles
22 articles
Page 1 of 1
Machine Heart
Machine Heart
May 8, 2026 · Artificial Intelligence

How Laser Cuts Token Use by 97% with Probabilistic Superposition for Implicit Multimodal Reasoning

Laser introduces a latent‑superposition paradigm that replaces step‑by‑step token prediction with dynamic windowed alignment, achieving over 97% token‑consumption reduction, new SOTA performance on six visual benchmarks, and improved interpretability for multimodal large models.

ACL 2026Dynamic Window AlignmentLatent Superposition
0 likes · 13 min read
How Laser Cuts Token Use by 97% with Probabilistic Superposition for Implicit Multimodal Reasoning
Machine Learning Algorithms & Natural Language Processing
Machine Learning Algorithms & Natural Language Processing
May 7, 2026 · Artificial Intelligence

Latent Action RL Shrinks Exploration Space for Multimodal Dialogue Fine‑Tuning

By learning a compact latent‑action space from paired image‑text and large‑scale text data, the authors reduce the RL search space from a vocabulary of over 150 k tokens to a 128‑codebook, enabling more efficient fine‑tuning of multimodal conversational agents and achieving consistent gains across several RL algorithms.

Vision-Language Modelsdialogue agentslatent actions
0 likes · 11 min read
Latent Action RL Shrinks Exploration Space for Multimodal Dialogue Fine‑Tuning
Machine Heart
Machine Heart
Apr 29, 2026 · Artificial Intelligence

Boost Black-Box VLMs Without Training: Class-Aware Prompt Reweighting (CARPRT)

The article analyzes the prompt‑sensitivity problem of zero‑shot classification in vision‑language models, critiques class‑agnostic prompt weighting, and presents CARPRT—a training‑free, black‑box compatible method that reweights prompts per class using similarity scores and pseudo‑labels, achieving consistent gains across datasets and model architectures.

Black-Box OptimizationClass-Aware ModelingPrompt Reweighting
0 likes · 11 min read
Boost Black-Box VLMs Without Training: Class-Aware Prompt Reweighting (CARPRT)
Machine Heart
Machine Heart
Apr 27, 2026 · Artificial Intelligence

What Do Your Logits Know? Surprising Insights from Apple’s New AI Paper

Apple’s recent AI paper probes whether large vision‑language models truly forget user data by examining residual streams and final logits, revealing that hidden image attributes persist in top‑k outputs and exposing significant privacy and security risks.

AI securityVision-Language Modelsinformation bottleneck
0 likes · 11 min read
What Do Your Logits Know? Surprising Insights from Apple’s New AI Paper
Data Party THU
Data Party THU
Apr 14, 2026 · Artificial Intelligence

Heterogeneous Hyperbolic Manifolds for Better Vision-Language Tree Alignment

This paper introduces a novel framework that constructs and aligns dual visual‑textual trees on heterogeneous hyperbolic manifolds, addressing asymmetric modality alignment in hierarchical classification tasks and achieving state‑of‑the‑art performance on benchmarks such as CIFAR‑100, ImageNet and Rare Species datasets.

Cross-AttentionHierarchical ClassificationVision-Language Models
0 likes · 8 min read
Heterogeneous Hyperbolic Manifolds for Better Vision-Language Tree Alignment
Machine Learning Algorithms & Natural Language Processing
Machine Learning Algorithms & Natural Language Processing
Mar 18, 2026 · Artificial Intelligence

Breaking the ‘See‑then‑Think’ Barrier: Real‑Time ‘See‑and‑Think’ for VLMs (CVPR 2026)

The paper introduces TaYS (Think‑as‑You‑See), a streaming chain‑of‑thought framework that replaces the traditional “watch‑then‑think” video inference pipeline with a parallel, real‑time “watch‑and‑think” approach, dramatically reducing latency and improving accuracy on complex video reasoning tasks.

Dual KV-CacheReal-time VideoStreaming Inference
0 likes · 8 min read
Breaking the ‘See‑then‑Think’ Barrier: Real‑Time ‘See‑and‑Think’ for VLMs (CVPR 2026)
AI Algorithm Path
AI Algorithm Path
Feb 16, 2026 · Artificial Intelligence

Why Visual Tokenizers Bridge the Gap Between Pixels and Meaning

Vision‑language models turn continuous images into discrete tokens through patch extraction, encoding, and projection, enabling Transformers to reason jointly over vision and text, but this compression introduces limits in spatial reasoning, counting, and resolution sensitivity that users must understand.

Self-AttentionVision-Language Modelscounting
0 likes · 22 min read
Why Visual Tokenizers Bridge the Gap Between Pixels and Meaning
HyperAI Super Neural
HyperAI Super Neural
Sep 28, 2025 · Artificial Intelligence

Weekly AI Paper Digest: Vision‑Language Models for Safety, Unstable Singularities, and RL‑Driven Reasoning

This week’s AI paper roundup highlights five recent studies: a construction‑site vision‑language dataset and safety inspection tasks, a deep CORAL method for unsupervised domain adaptation, the discovery of a new family of unstable singularities in nonlinear PDEs, a reinforcement‑learning approach that boosts reasoning in large language models, and the PANORAMA architecture for omnidirectional vision in embodied AI.

Construction SafetyOmnidirectional VisionPDE Research
0 likes · 6 min read
Weekly AI Paper Digest: Vision‑Language Models for Safety, Unstable Singularities, and RL‑Driven Reasoning
AntTech
AntTech
Sep 25, 2025 · Artificial Intelligence

ICCV Spotlight: Pixel Tracing for Copy Detection and Skip-Vision Model Acceleration

The ICCV 2025 live session will deep‑dive into two cutting‑edge papers—PixTrace with CopyNCE for precise image copy detection and Skip‑Vision for dramatically faster training and inference of vision‑language models—showcasing their methods, results, and real‑world impact.

Computer VisionICCV 2025Vision-Language Models
0 likes · 5 min read
ICCV Spotlight: Pixel Tracing for Copy Detection and Skip-Vision Model Acceleration
AI Algorithm Path
AI Algorithm Path
Sep 8, 2025 · Artificial Intelligence

Understanding MolmoAct: The Next‑Generation Large Action Model for Robotics

This article analyzes the MolmoAct large action model, detailing its three‑stage perception‑planning‑control architecture, novel depth‑aware tokenization, extensive pre‑training and fine‑tuning pipelines, and benchmark results that demonstrate superior efficiency and generalization over prior vision‑language‑action systems.

Model TrainingMolmoActRobotics
0 likes · 12 min read
Understanding MolmoAct: The Next‑Generation Large Action Model for Robotics
Data Party THU
Data Party THU
Aug 10, 2025 · Artificial Intelligence

Can Evolutionary Algorithms Auto-Design Training-Free Vision-Language Model Adaptations?

This study introduces EvoVLMA, an evolutionary vision-language model adaptation framework that automatically searches training-free VLM adaptation algorithms using a two-stage LLM-guided evolution, demonstrating superior performance—such as a 1.91 % accuracy gain on 8-shot image classification—and releasing the code publicly.

Evolutionary AlgorithmsLLMModel Adaptation
0 likes · 5 min read
Can Evolutionary Algorithms Auto-Design Training-Free Vision-Language Model Adaptations?
AI Algorithm Path
AI Algorithm Path
Aug 9, 2025 · Artificial Intelligence

How LoRA Enables Multimodal Capabilities in Large Language Models

This article compares two ways to add vision to large language models—training a native multimodal model from scratch or attaching a visual module to a pretrained LLM—then details the VoRA approach that uses LoRA adapters to inject visual knowledge without extra inference cost.

ChameleonLLaVALoRA
0 likes · 7 min read
How LoRA Enables Multimodal Capabilities in Large Language Models
AI Frontier Lectures
AI Frontier Lectures
Jul 18, 2025 · Artificial Intelligence

How Anchored Attributes Boost Prompt Learning for Vision‑Language Models

The paper introduces ATPrompt, a method that inserts fixed attribute tokens into learnable prompts for CLIP‑style vision‑language models, enabling the soft prompts to capture generic attribute representations and significantly improve base‑to‑novel generalization without extra regularization losses.

ATPromptVision-Language Modelsattribute anchoring
0 likes · 20 min read
How Anchored Attributes Boost Prompt Learning for Vision‑Language Models
DataFunTalk
DataFunTalk
Jul 11, 2025 · Artificial Intelligence

When AI Sees Six Fingers: Why Vision Models Miss the Mark

The article examines how multimodal AI models repeatedly miscount a six‑finger image, explores the underlying bias revealed in the paper “Vision Language Models are Biased,” and warns that such prior‑knowledge‑driven errors can have serious safety implications in real‑world applications.

AI biasMultimodal AIVision-Language Models
0 likes · 10 min read
When AI Sees Six Fingers: Why Vision Models Miss the Mark
AI Frontier Lectures
AI Frontier Lectures
Jun 16, 2025 · Artificial Intelligence

What Do the CVPR 2025 Awards Reveal About the Future of Computer Vision?

The CVPR 2025 awards spotlight groundbreaking work—from the VGGT transformer that predicts full 3D scenes in a single feed‑forward pass to neural inverse rendering that reconstructs geometry from time‑resolved light—offering a comprehensive view of emerging trends, novel architectures, and performance breakthroughs across computer‑vision research.

3D reconstructionCVPR 2025Deep Learning
0 likes · 11 min read
What Do the CVPR 2025 Awards Reveal About the Future of Computer Vision?
AIWalker
AIWalker
Apr 8, 2025 · Artificial Intelligence

AgenticIR: An Agentic System for Restoring Images with Complex Degradations

AgenticIR combines visual language models and large language models in a multi‑stage reasoning workflow—perception, planning, execution, reflection, and adjustment—to evaluate, plan, and iteratively apply specialized restoration tools, achieving superior results on complexly degraded images compared to baseline methods.

ICLR 2025Image RestorationVision-Language Models
0 likes · 15 min read
AgenticIR: An Agentic System for Restoring Images with Complex Degradations
AntTech
AntTech
Mar 18, 2025 · Artificial Intelligence

MoLE: Decoding by Mixture of Layer Experts Alleviates Hallucination in Large Vision-Language Models

Researchers from Ant Insurance and Zhejiang University propose MoLE, a Mixture of Layer Experts decoding method that reduces hallucinations in large vision‑language models, demonstrating state‑of‑the‑art performance on LVLM benchmarks and enabling reliable end‑to‑end medical‑record‑to‑claim automation.

AIMixture of ExpertsVision-Language Models
0 likes · 7 min read
MoLE: Decoding by Mixture of Layer Experts Alleviates Hallucination in Large Vision-Language Models
AIWalker
AIWalker
Mar 17, 2025 · Artificial Intelligence

How UNIFIEDREWARD Breaks Task Boundaries to Boost Image and Video Performance

The paper introduces UNIFIEDREWARD, the first unified reward model for multimodal understanding and generation that supports pairwise ranking and pointwise scoring, builds a 236K human‑preference dataset across image and video tasks, and uses DPO to align VLMs and diffusion models, achieving significant performance gains on both image and video benchmarks.

Direct Preference OptimizationMultimodal AIPreference Modeling
0 likes · 19 min read
How UNIFIEDREWARD Breaks Task Boundaries to Boost Image and Video Performance
NewBeeNLP
NewBeeNLP
Nov 11, 2024 · Artificial Intelligence

What Do Recent Multimodal LLM Papers Reveal About Vision‑Language Models?

This article surveys ten recent multimodal large language model papers, covering vision representation laws, a stricter instruction benchmark, safety impacts of visual adaptation, the Mini‑Gemini architecture, automatic pruning, vision capability boosting, long‑context transfer, efficient token sparsification, math reasoning, and hallucination mitigation.

BenchmarkTraining StrategiesVision-Language Models
0 likes · 18 min read
What Do Recent Multimodal LLM Papers Reveal About Vision‑Language Models?
DaTaobao Tech
DaTaobao Tech
Jul 1, 2024 · Artificial Intelligence

Recent Progress in Vision-Language Models (VLMs)

Over the past year, Vision‑Language Models have surged from early multimodal experiments to competitive open‑source systems rivaling GPT‑4, driven by higher‑resolution processing, richer vision encoders, better projection layers, and larger curated datasets, yet they still face evaluation difficulties, hallucinations, speed limits, and limited multimodal output.

Computer VisionDeep LearningVision-Language Models
0 likes · 24 min read
Recent Progress in Vision-Language Models (VLMs)