Tagged articles

Visual Reasoning

17 articles · Page 1 of 1

Jun 10, 2026 · Artificial Intelligence

How Visual Para-Thinker Tackles Visual Hallucination with a Clever Parallel Reasoning Design

The article introduces Visual Para-Thinker, a parallel reasoning framework for large vision‑language models that mitigates attention drift and visual hallucination by employing path‑aware attention, learnable parallel rotary position embeddings, and hybrid block‑and‑scan visual token partitions, and validates the approach with extensive multimodal benchmarks.

LPRoPEMultimodal BenchmarksParallel Attention

0 likes · 10 min read

How Visual Para-Thinker Tackles Visual Hallucination with a Clever Parallel Reasoning Design

Alibaba Cloud Developer

Jun 3, 2026 · Artificial Intelligence

Qwen3.7-Plus: Deep Reasoning, Visual Understanding, and End‑to‑End Multimodal Execution

Qwen3.7-Plus is a multimodal large‑model that unifies vision and language, delivers top‑5 global Vision Arena rankings, excels on a wide range of pure‑text, visual‑reasoning, and video benchmarks, and powers autonomous agents that perceive screens, generate code, and complete complex GUI/CLI workflows end‑to‑end.

Multimodal AIVisual Reasoningagent automation

0 likes · 14 min read

Qwen3.7-Plus: Deep Reasoning, Visual Understanding, and End‑to‑End Multimodal Execution

AI Programming Lab

Jun 1, 2026 · Artificial Intelligence

Claude Code Meets Step‑3.7‑Flash: Small Model, Big Multimodal Power

The article reviews Step‑3.7‑Flash, a high‑efficiency multimodal flash model designed for production‑grade agents, detailing its architecture, cost, benchmark results, native visual capabilities, integration with Claude Code via ccmr, and hands‑on experiments that illustrate its strengths and limits in multi‑step tasks.

AgentClaude CodeMultimodal

0 likes · 10 min read

Claude Code Meets Step‑3.7‑Flash: Small Model, Big Multimodal Power

Architect's Guide

Jun 1, 2026 · Artificial Intelligence

How OpenAI’s Images 2.0 Ushers in the “Thinking” Era of AI Image Generation

OpenAI’s Images 2.0 (gpt-image-2) replaces the traditional image‑generator model with an interactive creative engine that plans, searches the web, and self‑verifies before rendering, offering higher‑quality multi‑language text, batch consistency, and real‑time information at the cost of a token‑based pricing model and limited access to its most advanced features.

AI image generationCompetitive AnalysisGPT Image 2

0 likes · 32 min read

How OpenAI’s Images 2.0 Ushers in the “Thinking” Era of AI Image Generation

Machine Heart

May 22, 2026 · Artificial Intelligence

ATLAS: One Word Unifies Agentic and Latent Visual Reasoning

ATLAS introduces a discrete functional token that simultaneously serves as an agentic operation and a latent reasoning unit, enabling large multimodal models to perform visual tasks without external tools or intermediate image generation, and achieves competitive results through SFT‑plus‑RL training and a token‑level gradient‑anchor technique.

ATLASMultimodal AIVisual Reasoning

0 likes · 11 min read

ATLAS: One Word Unifies Agentic and Latent Visual Reasoning

Machine Heart

May 20, 2026 · Artificial Intelligence

How VChain Gives Video Generation a Visual Thought Chain for Explicit Spatiotemporal Planning

The VChain framework injects multimodal large‑model reasoning into video generation, using a three‑stage visual‑thought pipeline, sparse inference‑time adaptation, and guided sampling to produce physically consistent, logically coherent videos, as demonstrated by qualitative and quantitative experiments.

Multimodal Large ModelsSparse Fine‑tuningVisual Reasoning

0 likes · 8 min read

How VChain Gives Video Generation a Visual Thought Chain for Explicit Spatiotemporal Planning

SuanNi

Apr 30, 2026 · Artificial Intelligence

DeepSeek’s New Multimodal Paradigm Compresses Images 7,056× and Outperforms GPT‑4/Claude in Visual Reasoning

DeepSeek’s multimodal model, built on the V4‑Flash architecture and a visual‑primitive reasoning approach, compresses a full‑resolution image by 7,056 times, achieves comparable or superior performance to GPT‑5.4 and Claude‑Sonnet‑4.6 on counting and spatial‑reasoning benchmarks, and does so with dramatically lower compute.

DeepSeekMultimodal AIVisual Primitives

0 likes · 12 min read

DeepSeek’s New Multimodal Paradigm Compresses Images 7,056× and Outperforms GPT‑4/Claude in Visual Reasoning

Architects' Tech Alliance

Apr 23, 2026 · Artificial Intelligence

ChatGPT Images 2.0 Unleashes Terrifyingly Real Synthetic Images – How It Works and What Risks It Brings

OpenAI launched ChatGPT Images 2.0, a model that scores 242 on Image Arena, can generate photorealistic scenes, accurately render text and layouts, and even fabricate social‑media posts, financial receipts, and academic papers, raising a severe trust crisis for visual information.

AI image generationChatGPT Images 2.0OpenAI

0 likes · 9 min read

ChatGPT Images 2.0 Unleashes Terrifyingly Real Synthetic Images – How It Works and What Risks It Brings

Machine Heart

Apr 21, 2026 · Artificial Intelligence

Monet Enables Multimodal Models to Perform Human‑like Abstract Visual Thinking

Monet introduces a training paradigm that lets multimodal large language models reason directly in a continuous latent visual space, replacing external tool calls with implicit visual embeddings, and demonstrates significant gains on both in‑distribution perception tasks and out‑of‑distribution abstract visual reasoning through a three‑stage supervised fine‑tuning and a novel visual‑latent policy optimization.

Latent EmbeddingMLLMMultimodal

0 likes · 15 min read

Monet Enables Multimodal Models to Perform Human‑like Abstract Visual Thinking

Machine Learning Algorithms & Natural Language Processing

Apr 17, 2026 · Artificial Intelligence

Claude Opus 4.7’s Visual and Long‑Context Leap: Near‑Full Vision and 1M‑Token Tasks Redefine Knowledge Work

Claude Opus 4.7, announced as Anthropic’s most capable publicly available model, dramatically improves visual reasoning, long‑context task handling and instruction following, delivering up to a 2.4‑fold boost on benchmarks such as XBOW, SWE‑bench and structural biology, while also introducing new security guardrails and token‑usage costs.

AI benchmarksAnthropicClaude Opus 4.7

0 likes · 11 min read

Claude Opus 4.7’s Visual and Long‑Context Leap: Near‑Full Vision and 1M‑Token Tasks Redefine Knowledge Work

PaperAgent

Apr 4, 2026 · Artificial Intelligence

Can AI Master Contextual Photo Search? Inside DeepImageSearch, DISBench, and ImageSeeker

This article examines the DeepImageSearch project, which redefines image retrieval as contextual reasoning, introduces the challenging DISBench benchmark for visual agents, and details the ImageSeeker framework that equips models with multi‑tool interaction and hierarchical memory to tackle complex, multi‑event photo queries.

AI agentsDISBenchDeepImageSearch

0 likes · 9 min read

Can AI Master Contextual Photo Search? Inside DeepImageSearch, DISBench, and ImageSeeker

JavaEdge

Apr 2, 2026 · Artificial Intelligence

Unlocking Qwen3.6-Plus: Features, Multimodal Performance, and API Guide

This article provides an in‑depth overview of the Qwen3.6‑Plus model, detailing its million‑token context window, enhanced multimodal reasoning, benchmark results across language and vision tasks, and step‑by‑step instructions for using the official API and integrating the model with popular coding assistants.

API integrationMultimodal AIQwen3.6-Plus

0 likes · 12 min read

Unlocking Qwen3.6-Plus: Features, Multimodal Performance, and API Guide

AI Frontier Lectures

Feb 6, 2026 · Artificial Intelligence

Can Merging Text‑Only and Grounded Visual Reasoning Unlock Better Vision‑Language Models?

The paper introduces Mixture‑of‑Visual‑Thoughts (MoVT), a context‑adaptive reasoning paradigm that integrates pure‑text and visually‑grounded inference modes within a single model, and presents the two‑stage AdaVaR training framework with a novel AdaGRPO reinforcement‑learning algorithm to automatically select the optimal mode for each visual‑language task, achieving consistent gains across eight benchmarks and surpassing strong baselines including GPT‑4o.

AdaVaRMixture-of-Visual-ThoughtsVisual Reasoning

0 likes · 16 min read

Can Merging Text‑Only and Grounded Visual Reasoning Unlock Better Vision‑Language Models?

AI Algorithm Path

Apr 2, 2025 · Artificial Intelligence

Vision‑Reasoning Model: Enabling LLMs to See and Think

The article analyzes the limitations of current visual language models and large reasoning models, proposes a combined Vision‑Reasoning Model (VRM), details its architecture using LLaVA, describes end‑to‑end fine‑tuning and reinforcement‑learning reward design, and argues that such models will become the next breakthrough in AI.

DeepSeekLLaVALarge Language Model

0 likes · 9 min read

Vision‑Reasoning Model: Enabling LLMs to See and Think

JavaEdge

Mar 27, 2025 · Artificial Intelligence

Can a Single LLM Both See and Reason? Exploring Visual Reasoning Models (VRM)

This article examines the limitations of current vision‑language and reasoning models, proposes a visual reasoning model (VRM) that can process images and perform deep logical inference, and discusses architecture, training methods, reinforcement‑learning reward designs, and practical challenges.

LLMVisual Reasoningartificial-intelligence

0 likes · 8 min read

Can a Single LLM Both See and Reason? Exploring Visual Reasoning Models (VRM)

AntTech

Oct 29, 2024 · Artificial Intelligence

Three Ant Group Papers Featured at EMNLP 2024: Dynamic Transformers, Plug‑and‑Play Visual Reasoner, and Efficient Fine‑Tuning of Large Language Models

This announcement introduces three Ant Group papers accepted at EMNLP 2024—Mixture‑of‑Modules for dynamic Transformer assembly, a plug‑and‑play visual reasoning framework built via data synthesis, and a layer‑wise importance‑aware efficient fine‑tuning method for large language models—highlighting their innovations and upcoming live presentations.

AI researchEMNLP 2024Visual Reasoning

0 likes · 6 min read

Three Ant Group Papers Featured at EMNLP 2024: Dynamic Transformers, Plug‑and‑Play Visual Reasoner, and Efficient Fine‑Tuning of Large Language Models

Rare Earth Juejin Tech Community

Jun 22, 2024 · Artificial Intelligence

Claude 3.5 Sonnet: Performance Review and Real‑World Tests

Claude 3.5 Sonnet, Anthropic’s latest large language model, is evaluated across a range of Chinese‑language tasks, visual reasoning, coding, and game creation, showing faster, cheaper, and often superior results compared to GPT‑4o, while also revealing occasional failures in simple games and math problems.

AI modelAnthropicClaude 3.5

0 likes · 8 min read

Claude 3.5 Sonnet: Performance Review and Real‑World Tests