Tagged articles

visual language models

20 articles · Page 1 of 1
Machine Heart
Machine Heart
Jul 4, 2026 · Artificial Intelligence

When Swapping Two Images Breaks VLMs: EgoTSR Enables Robots to Judge Real Task Progress

The paper reveals that visual language models often rely on chronological bias, mistaking later frames for progress, and introduces EgoTSR—a 46‑million‑sample ego‑centric dataset and three‑stage curriculum that teaches models to assess task state, evaluate with forward‑reverse tests, and achieve over 92% accuracy on long‑term robotic tasks.

chronological-biascurriculum-learningego-centric reasoning
0 likes · 11 min read
When Swapping Two Images Breaks VLMs: EgoTSR Enables Robots to Judge Real Task Progress
Design Hub
Design Hub
Jun 20, 2026 · Artificial Intelligence

Can AI Really Judge Good Design? Findings from the Design Crit Study

Contra Labs' Design Crit dataset reveals that while AI can generate images, current AI judges barely outperform random guessing in assessing design quality, but a small fine‑tuned model can close nearly half the gap to human agreement by learning from expert‑rated criteria.

AI design evaluationDesign Crit datasetGenerative AI
0 likes · 16 min read
Can AI Really Judge Good Design? Findings from the Design Crit Study
Bighead's Algorithm Notes
Bighead's Algorithm Notes
Jun 20, 2026 · Artificial Intelligence

Can Large Vision‑Language Models Really Understand Candlestick Charts?

This paper builds a multi‑scale candlestick‑chart dataset and a standardized evaluation framework to measure how well visual language models (VLMs) extract price information, using confusion‑matrix diagnostics and Information Coefficient (IC) metrics, and finds that VLMs excel only on monotonic trends and struggle with precise time‑based predictions.

Prompt engineeringcandlestick chartinformation coefficient
0 likes · 13 min read
Can Large Vision‑Language Models Really Understand Candlestick Charts?
HyperAI Super Neural
HyperAI Super Neural
Jun 11, 2026 · Artificial Intelligence

ChartNet: MIT/IBM’s Million‑Scale Synthetic Chart Dataset with 1.5M Diverse Samples

MIT and IBM researchers introduce ChartNet, the largest code‑guided synthetic chart dataset containing 1.5 million multimodal samples across 24 chart types and six libraries, and demonstrate that fine‑tuning visual‑language models on it yields consistent, significant gains on chart reconstruction, data extraction, summarization, and reasoning tasks, outperforming much larger off‑the‑shelf models including GPT‑4o.

AI researchChartNetchart understanding
0 likes · 13 min read
ChartNet: MIT/IBM’s Million‑Scale Synthetic Chart Dataset with 1.5M Diverse Samples
JD Retail Technology
JD Retail Technology
Jun 2, 2026 · Artificial Intelligence

RTPrune: Two‑Stage Reading‑Inspired Token Pruning for Efficient DeepSeek‑OCR Inference

The paper presents RTPrune, a token‑pruning technique for DeepSeek‑OCR that exploits a two‑stage reading behavior in LLM decoding, first keeping high‑norm visual tokens and then fusing the rest via optimal‑transport matching with a dynamic pruning‑rate strategy, achieving up to 15% GFLOPs reduction and 18.9% speedup while preserving over 99% OCR accuracy across multiple benchmarks.

DeepSeek-OCROCR efficiencydynamic pruning
0 likes · 9 min read
RTPrune: Two‑Stage Reading‑Inspired Token Pruning for Efficient DeepSeek‑OCR Inference
Machine Learning Algorithms & Natural Language Processing
Machine Learning Algorithms & Natural Language Processing
May 24, 2026 · Artificial Intelligence

The First Visual‑Language Parallel Thinking Framework: Unpacking Its Core Mechanisms

The paper introduces Visual Para-Thinker, a parallel‑thinking framework for large‑scale visual‑language models that uses visual‑centered block and scan path partitions, Path‑aware Attention and Learnable Parallel Rotary Position Embedding, and demonstrates consistent gains across counting, visual search, hallucination and grounding benchmarks.

LPRoPEMultimodal AIPa-Attention
0 likes · 11 min read
The First Visual‑Language Parallel Thinking Framework: Unpacking Its Core Mechanisms
SuanNi
SuanNi
Mar 27, 2026 · Artificial Intelligence

How OmniScience Dataset Boosts Multimodal AI Understanding of Scientific Figures

The OmniScience project introduces a 1.5‑million high‑quality image‑text pair dataset and a sophisticated pipeline that parses complex scientific documents, rewrites figure captions with large language models, and dramatically improves multimodal AI performance on benchmark tests.

Multimodal AIdata annotationscientific dataset
0 likes · 9 min read
How OmniScience Dataset Boosts Multimodal AI Understanding of Scientific Figures
DeepHub IMBA
DeepHub IMBA
Mar 23, 2026 · Artificial Intelligence

How KgCoOp Uses Knowledge‑Guided Context Optimization to Prevent Prompt Tuning Forgetting

The article analyzes why standard prompt tuning (CoOp) causes catastrophic forgetting in visual‑language models, introduces the KgCoOp framework that adds a knowledge‑guided loss to regularize prompts, and shows through extensive experiments on 11 benchmarks that KgCoOp improves unseen‑class accuracy, harmonic mean, and efficiency while discussing trade‑offs and limitations.

Catastrophic ForgettingKnowledge-guided OptimizationPrompt Tuning
0 likes · 11 min read
How KgCoOp Uses Knowledge‑Guided Context Optimization to Prevent Prompt Tuning Forgetting
AI Explorer
AI Explorer
Feb 28, 2026 · Artificial Intelligence

How VLAW Unites World Models and Visual Language Models to Advance Embodied AI

The VLAW framework, developed by researchers from Tsinghua and Stanford, integrates high‑fidelity world models with visual‑language models, enabling real‑time physical interaction and intent understanding, which could dramatically improve training efficiency for embodied robots and mark a milestone toward safe, autonomous agents in complex real‑world environments.

Embodied AISimulationVLAW
0 likes · 6 min read
How VLAW Unites World Models and Visual Language Models to Advance Embodied AI
AI Algorithm Path
AI Algorithm Path
Feb 17, 2026 · Artificial Intelligence

Why Contrastive Learning Is the Core Foundation of Visual Language Models

The article explains how contrastive learning replaces fixed‑category visual training with a relationship‑based approach, detailing the dual‑encoder architecture, cosine similarity loss, batch scaling, temperature control, zero‑shot capabilities, scalability from web data, and the method's strengths and limitations in modern multimodal AI.

CLIPMultimodal AIcontrastive learning
0 likes · 25 min read
Why Contrastive Learning Is the Core Foundation of Visual Language Models
Tencent Advertising Technology
Tencent Advertising Technology
Feb 5, 2026 · Artificial Intelligence

How Multi-Agent VLMs and PNU Loss Achieve High‑Accuracy Harmful Content Detection with Only 50 Labels

This article presents a low‑resource offensive content detection framework that leverages multi‑agent visual‑language models (MA‑VLMs) for self‑training and a novel Positive‑Negative‑Unlabeled (PNU) loss, enabling accurate classification with as few as 50 annotated samples across multimodal datasets.

Multi-modal AIPNU lossSelf‑Training
0 likes · 20 min read
How Multi-Agent VLMs and PNU Loss Achieve High‑Accuracy Harmful Content Detection with Only 50 Labels
AI Algorithm Path
AI Algorithm Path
Jun 22, 2025 · Artificial Intelligence

Beginner’s Guide to Visual Language Models – Day 3: Contrastive Learning Loss Functions

This article systematically introduces the most common contrastive learning loss functions—including Contrastive Loss, Triplet Loss, N‑pair Loss, InfoNCE, and Cross‑Entropy—explaining their mathematical formulations, advantages, challenges, and typical applications in visual, textual, and multimodal representation learning.

InfoNCELoss Functionscontrastive learning
0 likes · 10 min read
Beginner’s Guide to Visual Language Models – Day 3: Contrastive Learning Loss Functions
AI Algorithm Path
AI Algorithm Path
Jun 20, 2025 · Artificial Intelligence

Beginner’s Guide to Visual Language Models – Day 2: Understanding Contrastive Learning

This article explains contrastive learning for visual language models, covering its definition, four‑step workflow, how to choose positive and negative pairs, the difference between supervised and self‑supervised variants, and why the technique is essential for zero‑shot and cross‑modal capabilities.

contrastive learningdata augmentationrepresentation learning
0 likes · 6 min read
Beginner’s Guide to Visual Language Models – Day 2: Understanding Contrastive Learning
AI Algorithm Path
AI Algorithm Path
Jun 20, 2025 · Artificial Intelligence

Beginner’s Guide to Visual Language Models – Day 1: What They Are and Why They Matter

This article introduces visual‑language models (VLMs), explaining how they combine large language models with visual encoders, why they overcome the rigidity of traditional computer‑vision systems, their key advantages, modular architecture, training methods, and practical applications such as image captioning and visual question answering.

AI ApplicationsLarge Language ModelsMultimodal AI
0 likes · 8 min read
Beginner’s Guide to Visual Language Models – Day 1: What They Are and Why They Matter
AIWalker
AIWalker
May 26, 2025 · Artificial Intelligence

VisionReasoner: RL‑Unified Model Beats YOLO‑World Detection, Segmentation, Counting

VisionReasoner presents a reinforcement‑learning‑driven unified framework that simultaneously tackles detection, segmentation, and counting tasks, employing a novel multi‑target cognition strategy and efficient Hungarian‑based matching, and demonstrates substantial gains—29.1% on COCO detection, 22.1% on ReasonSeg, and 15.3% on CountBench—using only 7,000 training samples.

Multi-Task LearningSegmentationVisionReasoner
0 likes · 20 min read
VisionReasoner: RL‑Unified Model Beats YOLO‑World Detection, Segmentation, Counting
AI Frontier Lectures
AI Frontier Lectures
May 23, 2025 · Artificial Intelligence

How SuperEdit Boosts Instruction-Based Image Editing with Rectified Supervision

SuperEdit introduces rectified instruction generation and contrastive supervision to fix noisy supervision in instruction‑based image editing, achieving up to 9.19% performance gains on Real‑Edit benchmarks without extra model parameters or pre‑training, and releases all data and code publicly.

Diffusion Modelsimage editingvisual language models
0 likes · 15 min read
How SuperEdit Boosts Instruction-Based Image Editing with Rectified Supervision
AI Algorithm Path
AI Algorithm Path
Apr 20, 2025 · Artificial Intelligence

Boosting Visual Reasoning in VLMs with Reinforcement Learning

The article analyzes how reinforcement learning, which transformed LLM reasoning in DeepSeek, can be applied to visual‑language models to overcome the limitations of traditional chain‑of‑thought prompting and supervised fine‑tuning, presenting concrete reward designs, training pipelines, and a critical assessment of their strengths and weaknesses.

Chain-of-ThoughtLLMRL Training
0 likes · 10 min read
Boosting Visual Reasoning in VLMs with Reinforcement Learning
Ximalaya Technology Team
Ximalaya Technology Team
Oct 10, 2023 · Artificial Intelligence

MiniGPT-5: A Novel Multimodal Generation Model for Coherent Text-Image Synthesis

MiniGPT-5 is a novel multimodal generation model using generative vokens to interleave text and image synthesis, integrating Stable Diffusion and LLMs with a two-stage training that requires no domain-specific annotations, achieving state‑of‑the‑art coherence and quality on benchmarks like CC3M, VIST, and MMDialog.

AI researchMultimodal GenerationStable Diffusion
0 likes · 9 min read
MiniGPT-5: A Novel Multimodal Generation Model for Coherent Text-Image Synthesis
Huolala Tech
Huolala Tech
Jul 21, 2023 · Artificial Intelligence

Visual Language Models Power Open-Set Detection and Surgical Tool Segmentation

Recent advances in visual language models enable zero-shot multimodal tasks, and this article explores their application to open-set object detection, prompt learning, and promptable surgical instrument segmentation, highlighting methods like CLIP, CoOp, and the DetPro framework with experimental results across multiple benchmarks.

MultimodalSemantic Segmentationcomputer vision
0 likes · 12 min read
Visual Language Models Power Open-Set Detection and Surgical Tool Segmentation