Tagged articles
12 articles
Page 1 of 1
Machine Heart
Machine Heart
May 20, 2026 · Artificial Intelligence

How VChain Gives Video Generation a Visual Thought Chain for Explicit Spatiotemporal Planning

The VChain framework injects multimodal large‑model reasoning into video generation, using a three‑stage visual‑thought pipeline, sparse inference‑time adaptation, and guided sampling to produce physically consistent, logically coherent videos, as demonstrated by qualitative and quantitative experiments.

Multimodal Large ModelsSparse Fine‑tuningVideo Generation
0 likes · 8 min read
How VChain Gives Video Generation a Visual Thought Chain for Explicit Spatiotemporal Planning
SuanNi
SuanNi
Apr 30, 2026 · Artificial Intelligence

DeepSeek’s New Multimodal Paradigm Compresses Images 7,056× and Outperforms GPT‑4/Claude in Visual Reasoning

DeepSeek’s multimodal model, built on the V4‑Flash architecture and a visual‑primitive reasoning approach, compresses a full‑resolution image by 7,056 times, achieves comparable or superior performance to GPT‑5.4 and Claude‑Sonnet‑4.6 on counting and spatial‑reasoning benchmarks, and does so with dramatically lower compute.

DeepSeekMultimodal AIVisual Primitives
0 likes · 12 min read
DeepSeek’s New Multimodal Paradigm Compresses Images 7,056× and Outperforms GPT‑4/Claude in Visual Reasoning
Architects' Tech Alliance
Architects' Tech Alliance
Apr 23, 2026 · Artificial Intelligence

ChatGPT Images 2.0 Unleashes Terrifyingly Real Synthetic Images – How It Works and What Risks It Brings

OpenAI launched ChatGPT Images 2.0, a model that scores 242 on Image Arena, can generate photorealistic scenes, accurately render text and layouts, and even fabricate social‑media posts, financial receipts, and academic papers, raising a severe trust crisis for visual information.

AI image generationChatGPT Images 2.0OpenAI
0 likes · 9 min read
ChatGPT Images 2.0 Unleashes Terrifyingly Real Synthetic Images – How It Works and What Risks It Brings
Machine Heart
Machine Heart
Apr 21, 2026 · Artificial Intelligence

Monet Enables Multimodal Models to Perform Human‑like Abstract Visual Thinking

Monet introduces a training paradigm that lets multimodal large language models reason directly in a continuous latent visual space, replacing external tool calls with implicit visual embeddings, and demonstrates significant gains on both in‑distribution perception tasks and out‑of‑distribution abstract visual reasoning through a three‑stage supervised fine‑tuning and a novel visual‑latent policy optimization.

Latent EmbeddingMLLMVisual Reasoning
0 likes · 15 min read
Monet Enables Multimodal Models to Perform Human‑like Abstract Visual Thinking
Machine Learning Algorithms & Natural Language Processing
Machine Learning Algorithms & Natural Language Processing
Apr 17, 2026 · Artificial Intelligence

Claude Opus 4.7’s Visual and Long‑Context Leap: Near‑Full Vision and 1M‑Token Tasks Redefine Knowledge Work

Claude Opus 4.7, announced as Anthropic’s most capable publicly available model, dramatically improves visual reasoning, long‑context task handling and instruction following, delivering up to a 2.4‑fold boost on benchmarks such as XBOW, SWE‑bench and structural biology, while also introducing new security guardrails and token‑usage costs.

AI benchmarksAnthropicClaude Opus 4.7
0 likes · 11 min read
Claude Opus 4.7’s Visual and Long‑Context Leap: Near‑Full Vision and 1M‑Token Tasks Redefine Knowledge Work
PaperAgent
PaperAgent
Apr 4, 2026 · Artificial Intelligence

Can AI Master Contextual Photo Search? Inside DeepImageSearch, DISBench, and ImageSeeker

This article examines the DeepImageSearch project, which redefines image retrieval as contextual reasoning, introduces the challenging DISBench benchmark for visual agents, and details the ImageSeeker framework that equips models with multi‑tool interaction and hierarchical memory to tackle complex, multi‑event photo queries.

AI agentsBenchmarkDISBench
0 likes · 9 min read
Can AI Master Contextual Photo Search? Inside DeepImageSearch, DISBench, and ImageSeeker
JavaEdge
JavaEdge
Apr 2, 2026 · Artificial Intelligence

Unlocking Qwen3.6-Plus: Features, Multimodal Performance, and API Guide

This article provides an in‑depth overview of the Qwen3.6‑Plus model, detailing its million‑token context window, enhanced multimodal reasoning, benchmark results across language and vision tasks, and step‑by‑step instructions for using the official API and integrating the model with popular coding assistants.

Multimodal AIQwen3.6-PlusVisual Reasoning
0 likes · 12 min read
Unlocking Qwen3.6-Plus: Features, Multimodal Performance, and API Guide
AI Frontier Lectures
AI Frontier Lectures
Feb 6, 2026 · Artificial Intelligence

Can Merging Text‑Only and Grounded Visual Reasoning Unlock Better Vision‑Language Models?

The paper introduces Mixture‑of‑Visual‑Thoughts (MoVT), a context‑adaptive reasoning paradigm that integrates pure‑text and visually‑grounded inference modes within a single model, and presents the two‑stage AdaVaR training framework with a novel AdaGRPO reinforcement‑learning algorithm to automatically select the optimal mode for each visual‑language task, achieving consistent gains across eight benchmarks and surpassing strong baselines including GPT‑4o.

AdaVaRMixture-of-Visual-ThoughtsVisual Reasoning
0 likes · 16 min read
Can Merging Text‑Only and Grounded Visual Reasoning Unlock Better Vision‑Language Models?
AI Algorithm Path
AI Algorithm Path
Apr 2, 2025 · Artificial Intelligence

Vision‑Reasoning Model: Enabling LLMs to See and Think

The article analyzes the limitations of current visual language models and large reasoning models, proposes a combined Vision‑Reasoning Model (VRM), details its architecture using LLaVA, describes end‑to‑end fine‑tuning and reinforcement‑learning reward design, and argues that such models will become the next breakthrough in AI.

DeepSeekLLaVAVision-Language Model
0 likes · 9 min read
Vision‑Reasoning Model: Enabling LLMs to See and Think
JavaEdge
JavaEdge
Mar 27, 2025 · Artificial Intelligence

Can a Single LLM Both See and Reason? Exploring Visual Reasoning Models (VRM)

This article examines the limitations of current vision‑language and reasoning models, proposes a visual reasoning model (VRM) that can process images and perform deep logical inference, and discusses architecture, training methods, reinforcement‑learning reward designs, and practical challenges.

Deep LearningLLMVision-Language Model
0 likes · 8 min read
Can a Single LLM Both See and Reason? Exploring Visual Reasoning Models (VRM)
AntTech
AntTech
Oct 29, 2024 · Artificial Intelligence

Three Ant Group Papers Featured at EMNLP 2024: Dynamic Transformers, Plug‑and‑Play Visual Reasoner, and Efficient Fine‑Tuning of Large Language Models

This announcement introduces three Ant Group papers accepted at EMNLP 2024—Mixture‑of‑Modules for dynamic Transformer assembly, a plug‑and‑play visual reasoning framework built via data synthesis, and a layer‑wise importance‑aware efficient fine‑tuning method for large language models—highlighting their innovations and upcoming live presentations.

AI researchEMNLP 2024Visual Reasoning
0 likes · 6 min read
Three Ant Group Papers Featured at EMNLP 2024: Dynamic Transformers, Plug‑and‑Play Visual Reasoner, and Efficient Fine‑Tuning of Large Language Models
Rare Earth Juejin Tech Community
Rare Earth Juejin Tech Community
Jun 22, 2024 · Artificial Intelligence

Claude 3.5 Sonnet: Performance Review and Real‑World Tests

Claude 3.5 Sonnet, Anthropic’s latest large language model, is evaluated across a range of Chinese‑language tasks, visual reasoning, coding, and game creation, showing faster, cheaper, and often superior results compared to GPT‑4o, while also revealing occasional failures in simple games and math problems.

AI modelAnthropicClaude 3.5
0 likes · 8 min read
Claude 3.5 Sonnet: Performance Review and Real‑World Tests