Tagged articles

visual language models

20 articles · Page 1 of 1

Jul 4, 2026 · Artificial Intelligence

When Swapping Two Images Breaks VLMs: EgoTSR Enables Robots to Judge Real Task Progress

The paper reveals that visual language models often rely on chronological bias, mistaking later frames for progress, and introduces EgoTSR—a 46‑million‑sample ego‑centric dataset and three‑stage curriculum that teaches models to assess task state, evaluate with forward‑reverse tests, and achieve over 92% accuracy on long‑term robotic tasks.

chronological-biascurriculum-learningego-centric reasoning

0 likes · 11 min read

When Swapping Two Images Breaks VLMs: EgoTSR Enables Robots to Judge Real Task Progress

Design Hub

Jun 20, 2026 · Artificial Intelligence

Can AI Really Judge Good Design? Findings from the Design Crit Study

Contra Labs' Design Crit dataset reveals that while AI can generate images, current AI judges barely outperform random guessing in assessing design quality, but a small fine‑tuned model can close nearly half the gap to human agreement by learning from expert‑rated criteria.

AI design evaluationDesign Crit datasetGenerative AI

0 likes · 16 min read

Can AI Really Judge Good Design? Findings from the Design Crit Study

Bighead's Algorithm Notes

Jun 20, 2026 · Artificial Intelligence

Can Large Vision‑Language Models Really Understand Candlestick Charts?

This paper builds a multi‑scale candlestick‑chart dataset and a standardized evaluation framework to measure how well visual language models (VLMs) extract price information, using confusion‑matrix diagnostics and Information Coefficient (IC) metrics, and finds that VLMs excel only on monotonic trends and struggle with precise time‑based predictions.

Prompt engineeringcandlestick chartinformation coefficient

0 likes · 13 min read

Can Large Vision‑Language Models Really Understand Candlestick Charts?

HyperAI Super Neural

Jun 11, 2026 · Artificial Intelligence

ChartNet: MIT/IBM’s Million‑Scale Synthetic Chart Dataset with 1.5M Diverse Samples

MIT and IBM researchers introduce ChartNet, the largest code‑guided synthetic chart dataset containing 1.5 million multimodal samples across 24 chart types and six libraries, and demonstrate that fine‑tuning visual‑language models on it yields consistent, significant gains on chart reconstruction, data extraction, summarization, and reasoning tasks, outperforming much larger off‑the‑shelf models including GPT‑4o.

AI researchChartNetchart understanding

0 likes · 13 min read

ChartNet: MIT/IBM’s Million‑Scale Synthetic Chart Dataset with 1.5M Diverse Samples

JD Retail Technology

Jun 2, 2026 · Artificial Intelligence

RTPrune: Two‑Stage Reading‑Inspired Token Pruning for Efficient DeepSeek‑OCR Inference

The paper presents RTPrune, a token‑pruning technique for DeepSeek‑OCR that exploits a two‑stage reading behavior in LLM decoding, first keeping high‑norm visual tokens and then fusing the rest via optimal‑transport matching with a dynamic pruning‑rate strategy, achieving up to 15% GFLOPs reduction and 18.9% speedup while preserving over 99% OCR accuracy across multiple benchmarks.

DeepSeek-OCROCR efficiencydynamic pruning

0 likes · 9 min read

RTPrune: Two‑Stage Reading‑Inspired Token Pruning for Efficient DeepSeek‑OCR Inference

Machine Learning Algorithms & Natural Language Processing

May 24, 2026 · Artificial Intelligence

The First Visual‑Language Parallel Thinking Framework: Unpacking Its Core Mechanisms

The paper introduces Visual Para-Thinker, a parallel‑thinking framework for large‑scale visual‑language models that uses visual‑centered block and scan path partitions, Path‑aware Attention and Learnable Parallel Rotary Position Embedding, and demonstrates consistent gains across counting, visual search, hallucination and grounding benchmarks.

LPRoPEMultimodal AIPa-Attention

0 likes · 11 min read

The First Visual‑Language Parallel Thinking Framework: Unpacking Its Core Mechanisms

SuanNi

Mar 27, 2026 · Artificial Intelligence

How OmniScience Dataset Boosts Multimodal AI Understanding of Scientific Figures

The OmniScience project introduces a 1.5‑million high‑quality image‑text pair dataset and a sophisticated pipeline that parses complex scientific documents, rewrites figure captions with large language models, and dramatically improves multimodal AI performance on benchmark tests.

Multimodal AIdata annotationscientific dataset

0 likes · 9 min read

How OmniScience Dataset Boosts Multimodal AI Understanding of Scientific Figures

DeepHub IMBA

Mar 23, 2026 · Artificial Intelligence

How KgCoOp Uses Knowledge‑Guided Context Optimization to Prevent Prompt Tuning Forgetting

The article analyzes why standard prompt tuning (CoOp) causes catastrophic forgetting in visual‑language models, introduces the KgCoOp framework that adds a knowledge‑guided loss to regularize prompts, and shows through extensive experiments on 11 benchmarks that KgCoOp improves unseen‑class accuracy, harmonic mean, and efficiency while discussing trade‑offs and limitations.

Catastrophic ForgettingKnowledge-guided OptimizationPrompt Tuning

0 likes · 11 min read

How KgCoOp Uses Knowledge‑Guided Context Optimization to Prevent Prompt Tuning Forgetting

AI Explorer

Feb 28, 2026 · Artificial Intelligence

How VLAW Unites World Models and Visual Language Models to Advance Embodied AI

The VLAW framework, developed by researchers from Tsinghua and Stanford, integrates high‑fidelity world models with visual‑language models, enabling real‑time physical interaction and intent understanding, which could dramatically improve training efficiency for embodied robots and mark a milestone toward safe, autonomous agents in complex real‑world environments.

Embodied AISimulationVLAW

0 likes · 6 min read

How VLAW Unites World Models and Visual Language Models to Advance Embodied AI

AI Algorithm Path

Feb 17, 2026 · Artificial Intelligence

Why Contrastive Learning Is the Core Foundation of Visual Language Models

The article explains how contrastive learning replaces fixed‑category visual training with a relationship‑based approach, detailing the dual‑encoder architecture, cosine similarity loss, batch scaling, temperature control, zero‑shot capabilities, scalability from web data, and the method's strengths and limitations in modern multimodal AI.

CLIPMultimodal AIcontrastive learning

0 likes · 25 min read

Why Contrastive Learning Is the Core Foundation of Visual Language Models

Tencent Advertising Technology

Feb 5, 2026 · Artificial Intelligence

How Multi-Agent VLMs and PNU Loss Achieve High‑Accuracy Harmful Content Detection with Only 50 Labels

This article presents a low‑resource offensive content detection framework that leverages multi‑agent visual‑language models (MA‑VLMs) for self‑training and a novel Positive‑Negative‑Unlabeled (PNU) loss, enabling accurate classification with as few as 50 annotated samples across multimodal datasets.

Multi-modal AIPNU lossSelf‑Training

0 likes · 20 min read

How Multi-Agent VLMs and PNU Loss Achieve High‑Accuracy Harmful Content Detection with Only 50 Labels

Amap Tech

Jul 14, 2025 · Artificial Intelligence

Zero-Shot Domain Adaptation for Object Detection: How UPRE Boosts Cross-Domain Performance

The UPRE framework introduces multi‑view domain prompts and unified representation enhancement to achieve zero‑shot domain adaptation for object detection, dramatically improving detection accuracy on unseen target domains across diverse visual scenarios.

Prompt engineeringcross-domain learningobject detection

0 likes · 10 min read

Zero-Shot Domain Adaptation for Object Detection: How UPRE Boosts Cross-Domain Performance

AI Algorithm Path

Jun 22, 2025 · Artificial Intelligence

Beginner’s Guide to Visual Language Models – Day 3: Contrastive Learning Loss Functions

This article systematically introduces the most common contrastive learning loss functions—including Contrastive Loss, Triplet Loss, N‑pair Loss, InfoNCE, and Cross‑Entropy—explaining their mathematical formulations, advantages, challenges, and typical applications in visual, textual, and multimodal representation learning.

InfoNCELoss Functionscontrastive learning

0 likes · 10 min read

Beginner’s Guide to Visual Language Models – Day 3: Contrastive Learning Loss Functions

AI Algorithm Path

Jun 20, 2025 · Artificial Intelligence

Beginner’s Guide to Visual Language Models – Day 2: Understanding Contrastive Learning

This article explains contrastive learning for visual language models, covering its definition, four‑step workflow, how to choose positive and negative pairs, the difference between supervised and self‑supervised variants, and why the technique is essential for zero‑shot and cross‑modal capabilities.

contrastive learningdata augmentationrepresentation learning

0 likes · 6 min read

Beginner’s Guide to Visual Language Models – Day 2: Understanding Contrastive Learning

AI Algorithm Path

Jun 20, 2025 · Artificial Intelligence

Beginner’s Guide to Visual Language Models – Day 1: What They Are and Why They Matter

This article introduces visual‑language models (VLMs), explaining how they combine large language models with visual encoders, why they overcome the rigidity of traditional computer‑vision systems, their key advantages, modular architecture, training methods, and practical applications such as image captioning and visual question answering.

AI ApplicationsLarge Language ModelsMultimodal AI

0 likes · 8 min read

Beginner’s Guide to Visual Language Models – Day 1: What They Are and Why They Matter

AIWalker

May 26, 2025 · Artificial Intelligence

VisionReasoner: RL‑Unified Model Beats YOLO‑World Detection, Segmentation, Counting

VisionReasoner presents a reinforcement‑learning‑driven unified framework that simultaneously tackles detection, segmentation, and counting tasks, employing a novel multi‑target cognition strategy and efficient Hungarian‑based matching, and demonstrates substantial gains—29.1% on COCO detection, 22.1% on ReasonSeg, and 15.3% on CountBench—using only 7,000 training samples.

Multi-Task LearningSegmentationVisionReasoner

0 likes · 20 min read

VisionReasoner: RL‑Unified Model Beats YOLO‑World Detection, Segmentation, Counting

AI Frontier Lectures

May 23, 2025 · Artificial Intelligence

How SuperEdit Boosts Instruction-Based Image Editing with Rectified Supervision

SuperEdit introduces rectified instruction generation and contrastive supervision to fix noisy supervision in instruction‑based image editing, achieving up to 9.19% performance gains on Real‑Edit benchmarks without extra model parameters or pre‑training, and releases all data and code publicly.

Diffusion Modelsimage editingvisual language models

0 likes · 15 min read

How SuperEdit Boosts Instruction-Based Image Editing with Rectified Supervision

AI Algorithm Path

Apr 20, 2025 · Artificial Intelligence

Boosting Visual Reasoning in VLMs with Reinforcement Learning

The article analyzes how reinforcement learning, which transformed LLM reasoning in DeepSeek, can be applied to visual‑language models to overcome the limitations of traditional chain‑of‑thought prompting and supervised fine‑tuning, presenting concrete reward designs, training pipelines, and a critical assessment of their strengths and weaknesses.

Chain-of-ThoughtLLMRL Training

0 likes · 10 min read

Boosting Visual Reasoning in VLMs with Reinforcement Learning

Ximalaya Technology Team

Oct 10, 2023 · Artificial Intelligence

MiniGPT-5: A Novel Multimodal Generation Model for Coherent Text-Image Synthesis

MiniGPT-5 is a novel multimodal generation model using generative vokens to interleave text and image synthesis, integrating Stable Diffusion and LLMs with a two-stage training that requires no domain-specific annotations, achieving state‑of‑the‑art coherence and quality on benchmarks like CC3M, VIST, and MMDialog.

AI researchMultimodal GenerationStable Diffusion

0 likes · 9 min read

MiniGPT-5: A Novel Multimodal Generation Model for Coherent Text-Image Synthesis

Huolala Tech

Jul 21, 2023 · Artificial Intelligence

Visual Language Models Power Open-Set Detection and Surgical Tool Segmentation

Recent advances in visual language models enable zero-shot multimodal tasks, and this article explores their application to open-set object detection, prompt learning, and promptable surgical instrument segmentation, highlighting methods like CLIP, CoOp, and the DetPro framework with experimental results across multiple benchmarks.

MultimodalSemantic Segmentationcomputer vision

0 likes · 12 min read

Visual Language Models Power Open-Set Detection and Surgical Tool Segmentation