Tagged articles

37 articles

Page 1 of 1

May 9, 2026 · Artificial Intelligence

BARD-VL Achieves New SOTA for Multimodal Diffusion Models via Autoregressive‑Diffusion Bridge

The BARD-VL framework bridges pretrained autoregressive vision‑language models to diffusion‑based VLMs, preserving or surpassing original performance while boosting decoding throughput up to three times, through progressive block merging, stage‑wise diffusion distillation, and engineering optimizations validated on multiple benchmarks.

BARD-VLBenchmarkdiffusion

0 likes · 9 min read

BARD-VL Achieves New SOTA for Multimodal Diffusion Models via Autoregressive‑Diffusion Bridge

ArcThink

Apr 29, 2026 · Artificial Intelligence

DeepSeek V4 Vision Mode: Architecture Breakdown and Benchmark vs Top Models

The article dissects DeepSeek V4's newly released vision mode, explains its mounted visual‑language architecture, compares its multimodal capabilities and costs against GPT‑5.5, Gemini 3 and Claude Opus 4.7, and outlines a roadmap from image understanding to native multimodal AI.

AIBenchmarkDeepSeek

0 likes · 15 min read

DeepSeek V4 Vision Mode: Architecture Breakdown and Benchmark vs Top Models

SuanNi

Apr 29, 2026 · Artificial Intelligence

SenseNova U1: Open‑Source SOTA Multimodal Model Unifies Vision and Language

SenseNova U1, an open‑source multimodal model from SenseTime, replaces traditional visual encoders and VAEs with a native NEO‑unify architecture, delivering near‑lossless pixel‑level fidelity, a mixed‑of‑Transformer backbone, and unified training objectives that achieve SOTA performance on diverse vision‑language benchmarks while running efficiently on multiple Chinese chips.

BenchmarkNEO-UnifySenseNova U1

0 likes · 9 min read

SenseNova U1: Open‑Source SOTA Multimodal Model Unifies Vision and Language

Machine Heart

Apr 27, 2026 · Artificial Intelligence

Google DeepMind Open‑Sources TIPSv2: State‑of‑the‑Art Patch‑Text Alignment at CVPR 2026

The DeepMind team unveils TIPSv2, a vision‑language pre‑training model that dramatically improves patch‑level image‑text alignment through iBOT++, Head‑only EMA, and multi‑granularity captions, achieving record‑breaking results on nine tasks across twenty datasets while remaining fully open‑source.

Computer VisionDeepMindMultimodal Pretraining

0 likes · 12 min read

Google DeepMind Open‑Sources TIPSv2: State‑of‑the‑Art Patch‑Text Alignment at CVPR 2026

Old Zhang's AI Learning

Apr 19, 2026 · Artificial Intelligence

From Zero to Deployment: A Complete Qwen3.5 Fine‑Tuning Guide

This guide shows how to fine‑tune Qwen3.5 models—from 0.8B to 122B—using Unsloth Studio or pure code, covering text SFT, vision fine‑tuning, MoE models, reinforcement‑learning (GRPO), extensive GGUF quantization benchmarks, hardware requirements, export formats, and deployment tips.

Fine-tuningLLMUnsloth

0 likes · 12 min read

From Zero to Deployment: A Complete Qwen3.5 Fine‑Tuning Guide

PaperAgent

Apr 13, 2026 · Artificial Intelligence

How Keyframe‑Chaining VLA Gives Robots Long‑Term Memory and Faster Reasoning

The article introduces the Keyframe‑Chaining VLA (KC‑VLA) framework, which replaces dense video sampling with semantic keyframe linking to provide robots with global temporal awareness, presents a new long‑term memory benchmark, and demonstrates superior performance in both simulation and real‑world robotic experiments.

AIKeyframe ChainingLong-term Memory

0 likes · 9 min read

How Keyframe‑Chaining VLA Gives Robots Long‑Term Memory and Faster Reasoning

Machine Learning Algorithms & Natural Language Processing

Mar 31, 2026 · Artificial Intelligence

Unified Multimodal Modeling: How LongCat-Next Bridges Understanding and Generation

The article analyzes why text models naturally combine understanding and generation, explains the fundamental conflicts that prevent images from sharing the same tokenization, and details LongCat-Next’s discrete autoregressive approach—using SAE visual encoders, residual vector quantization, and a unified LLM backbone—to achieve a single model that can both comprehend and create multimodal content.

LongCat-NextRVQdNaViT

0 likes · 21 min read

Unified Multimodal Modeling: How LongCat-Next Bridges Understanding and Generation

Machine Learning Algorithms & Natural Language Processing

Mar 28, 2026 · Artificial Intelligence

Do All Physical Signals Reduce to a Single Discrete Token? LongCat‑Next Explained

LongCat‑Next, Meituan’s new 3‑billion‑parameter foundation model, adopts a pure‑discrete DiNA architecture with next‑token prediction, converting vision, audio and text into unified tokens; it surpasses same‑size multimodal models on OmniDocBench‑EN, CharXivRQ and SWE‑Bench, avoids catastrophic forgetting, and introduces dNaViT, RVQ compression and a dual‑path detokenizer for high‑fidelity generation.

DiNALongCat-NextSWE-bench

0 likes · 10 min read

Do All Physical Signals Reduce to a Single Discrete Token? LongCat‑Next Explained

Tencent Technical Engineering

Jan 30, 2026 · Artificial Intelligence

Can Rendering Thought Chains as Images Speed Up LLM Reasoning?

This article introduces Render‑of‑Thought (RoT), a novel paradigm that compresses chain‑of‑thought reasoning into visual embeddings using frozen vision encoders, achieving 3‑4× token reduction, faster inference, and improved interpretability while requiring minimal pre‑training.

Inference OptimizationLatent SpaceToken Compression

0 likes · 12 min read

Can Rendering Thought Chains as Images Speed Up LLM Reasoning?

PaperAgent

Jan 27, 2026 · Artificial Intelligence

How DeepSeek-OCR 2’s Dual-Flow Attention Redefines Document Understanding

DeepSeek-OCR 2 introduces a novel dual‑stream (bidirectional + causal) attention architecture that replaces fixed raster scanning, leverages a Qwen2‑0.5B encoder, and achieves state‑of‑the‑art accuracy on OmniDocBench while reducing token budget and improving reading‑order consistency.

DeepEncoderDeepSeekDual-Stream Attention

0 likes · 8 min read

How DeepSeek-OCR 2’s Dual-Flow Attention Redefines Document Understanding

PaperAgent

Jan 23, 2026 · Artificial Intelligence

Top AAAI 2026 Papers: New Vision‑Language‑Action Model, LLM2CLIP and More

AAAI 2026 in Singapore showcased 23,680 submissions, highlighting breakthrough papers such as ReconVLA’s reconstructive vision‑language‑action model, LLM2CLIP’s language‑enhanced multimodal representation, a sheaflet‑based hypergraph neural network design, advances in description logic modeling, and a novel causal discovery method for dynamical systems.

AAAI 2026AI PapersLLM

0 likes · 7 min read

Top AAAI 2026 Papers: New Vision‑Language‑Action Model, LLM2CLIP and More

Kuaishou Tech

Nov 28, 2025 · Artificial Intelligence

Keye-VL-671B-A37B Leads Vision, Video, and Math Benchmarks

Kwai has open‑sourced its new flagship multimodal model Keye‑VL‑671B‑A37B, which upgrades visual perception, cross‑modal alignment and complex reasoning, achieving top scores on image, video, and mathematical reasoning benchmarks while detailing its architecture, three‑stage pre‑training, post‑training strategies, and future multimodal agent plans.

Deep Learninglarge language modelmultimodal

0 likes · 10 min read

Keye-VL-671B-A37B Leads Vision, Video, and Math Benchmarks

Data Party THU

Sep 28, 2025 · Artificial Intelligence

How YOLO-Count Enables Precise Object Counting in Text-to-Image Generation

This article reviews the YOLO-Count model, a fully differentiable, open‑vocabulary object counting system that guides text‑to‑image generators to produce the exact number of objects specified in prompts, achieving state‑of‑the‑art results on both generic counting and controlled image synthesis tasks.

Object CountingYOLO-Countdifferentiable model

0 likes · 8 min read

How YOLO-Count Enables Precise Object Counting in Text-to-Image Generation

Kuaishou Large Model

Sep 8, 2025 · Artificial Intelligence

Keye-VL-1.5-8B: The New Multimodal LLM That Beats GPT-4o on Vision Benchmarks

Kwai's newly released Keye-VL-1.5-8B multimodal large language model dramatically improves visual, reasoning, and temporal understanding, achieving top scores on public video benchmarks and surpassing closed‑source models like GPT‑4o, while offering an open‑source release and detailed technical documentation.

benchmark performancemultimodal LLMopen-source

0 likes · 11 min read

Keye-VL-1.5-8B: The New Multimodal LLM That Beats GPT-4o on Vision Benchmarks

Kuaishou Tech

Aug 23, 2025 · Artificial Intelligence

How Thyme Enables Models to Think Beyond Images with Code‑Driven Multimodal Reasoning

The Kwai Keye team presents Thyme, a novel multimodal reasoning framework that lets large language models generate and safely execute Python code for image manipulation and complex calculations, achieving significant performance gains over existing vision‑language models across perception, reasoning, and hallucination‑reduction benchmarks.

AI researchCode Generationlarge language model

0 likes · 12 min read

How Thyme Enables Models to Think Beyond Images with Code‑Driven Multimodal Reasoning

AIWalker

Aug 6, 2025 · Artificial Intelligence

Why ByteDance’s 7B BAGEL Model Rivals GPT‑4o in Unified Multimodal Understanding and Generation

The article provides an in‑depth technical analysis of ByteDance’s 7‑billion‑parameter BAGEL model, detailing its MoT architecture, high‑quality interleaved multimodal pre‑training data, multi‑stage training strategy, emergent capabilities, and extensive benchmark results that show BAGEL matching or surpassing GPT‑4o on vision‑language tasks.

BAGELGPT-4o comparisonMultimodal AI

0 likes · 24 min read

Why ByteDance’s 7B BAGEL Model Rivals GPT‑4o in Unified Multimodal Understanding and Generation

Xiaohongshu Tech REDtech

Aug 6, 2025 · Artificial Intelligence

dots.vlm1: Open‑Source Multimodal Vision‑Language Model Near SOTA Performance

dots.vlm1, the first open‑source multimodal large model from Xiaohongshu hi‑lab, combines a 1.2‑billion‑parameter NaViT visual encoder with DeepSeek V3 LLM, achieving near‑state‑of‑the‑art visual understanding and reasoning while remaining competitive on text tasks, and is available on GitHub and HuggingFace.

AIdeep-learninglarge-model

0 likes · 11 min read

dots.vlm1: Open‑Source Multimodal Vision‑Language Model Near SOTA Performance

Volcano Engine Developer Services

Jul 8, 2025 · Artificial Intelligence

Unlocking Autonomous GUI Agents: Inside UI‑TARS Multimodal Vision Model

This article introduces UI‑TARS, a multimodal visual model combined with the Model Context Protocol (MCP) to build next‑generation cross‑platform autonomous GUI agents, detailing its architecture, workflow, code examples, incremental inference, applications, challenges, and future research directions.

AIAutomationGUI Agent

0 likes · 20 min read

Unlocking Autonomous GUI Agents: Inside UI‑TARS Multimodal Vision Model

Network Intelligence Research Center (NIRC)

May 14, 2025 · Artificial Intelligence

Hands‑On CLIP: Implementing Multimodal Vision‑Language Understanding

This article introduces OpenAI’s CLIP multimodal model, explains its architecture and contrastive training, details hardware and installation steps, and demonstrates a hands‑on zero‑shot image classification workflow that achieves 97% confidence on a cat image without any task‑specific fine‑tuning.

CLIPPythoncontrastive learning

0 likes · 6 min read

Hands‑On CLIP: Implementing Multimodal Vision‑Language Understanding

Baidu Geek Talk

Apr 2, 2025 · Artificial Intelligence

DeepSeek-VL2 Multimodal Model: Architecture, Training, and Code Walkthrough

DeepSeek‑VL2 is a state‑of‑the‑art multimodal model built on a Mixture‑of‑Experts architecture that combines a SigLIP‑L vision encoder with dynamic tiling, a two‑layer VL adaptor, and a DeepSeek‑MoE language model using Multi‑head Latent Attention, trained in three stages on diverse visual‑language and text data, and achieving strong results on benchmarks such as DocVQA and TextVQA, with full implementation and inference code available in PaddleMIX.

DeepSeek-VL2InferenceMixture of Experts

0 likes · 36 min read

DeepSeek-VL2 Multimodal Model: Architecture, Training, and Code Walkthrough

Meituan Technology Team

Mar 27, 2025 · Artificial Intelligence

Q-Eval-100K Dataset and Q-Eval-Score Evaluation Framework for Text-to-Visual Generation

The Q‑Eval‑100K dataset, comprising 100 k AIGC images and videos with separate visual‑quality and textual‑consistency annotations, powers the open‑source Q‑Eval‑Score framework that fine‑tunes multimodal models to deliver state‑of‑the‑art, scalable, and objective evaluation—including a “vague‑to‑specific” strategy for long prompts—surpassing existing benchmarks.

AIGCDatasetevaluation

0 likes · 9 min read

Q-Eval-100K Dataset and Q-Eval-Score Evaluation Framework for Text-to-Visual Generation

DaTaobao Tech

Dec 30, 2024 · Artificial Intelligence

AI Research Highlights: AAAI 2025 & NeurIPS 2024 Breakthroughs in Image Generation

This article compiles recent AI research breakthroughs presented at AAAI 2025 and NeurIPS 2024, summarizing eight papers on multi‑condition image generation, mixed auto‑regressive models, hallucination mitigation in vision‑language models, quantized diffusion denoising, facial part swapping, language‑guided concept vectors, attribution consistency, and video virtual try‑on, with links to each work.

AAAI 2025AI researchGenerative Models

0 likes · 13 min read

AI Research Highlights: AAAI 2025 & NeurIPS 2024 Breakthroughs in Image Generation

21CTO

Dec 4, 2024 · Artificial Intelligence

Introducing Pi-zero: A General‑Purpose AI Foundation Model for Robotics

Physical Intelligence's new Pi-zero model, built on a vision‑language foundation and fine‑tuned with extensive robot data, outperforms prior baselines across multiple tasks, showcasing the promise of large multimodal foundation models for flexible, robust robot control.

AIPi-zerofoundation-models

0 likes · 6 min read

Introducing Pi-zero: A General‑Purpose AI Foundation Model for Robotics

HyperAI Super Neural

Nov 20, 2024 · Artificial Intelligence

From Computer Vision to Medical AI: Prof. Xie's Work Hits Nature, NeurIPS, CVPR

Professor Xie's team at Shanghai Jiao Tong University reports rapid progress in AI for Science, detailing multimodal medical AI models, large open datasets, language and vision‑language models, and knowledge‑enhanced representations that outperform existing baselines across multiple benchmarks.

Knowledge GraphsOpen Datasetslarge language models

0 likes · 14 min read

From Computer Vision to Medical AI: Prof. Xie's Work Hits Nature, NeurIPS, CVPR

DataFunSummit

Nov 1, 2024 · Artificial Intelligence

Progress in Multimodal Large Language Models: Background, Architecture, Evolution, Team Work, and Future Outlook

This article reviews recent advances in multimodal large language models, covering their background, architectural components, training strategies, application scenarios, evaluation benchmarks, team research on hallucination mitigation and long‑video understanding, and outlines promising future research directions.

Model architectureevaluation benchmarksfuture research

0 likes · 15 min read

Progress in Multimodal Large Language Models: Background, Architecture, Evolution, Team Work, and Future Outlook

DataFunSummit

Oct 28, 2024 · Artificial Intelligence

Exploration and Practice of Multimodal Large Models at 360

This article presents 360's comprehensive exploration of image‑text multimodal large models, covering background concepts, research routes, three generations of model development, proprietary architectures like SEEChat, 360VL and Inner‑Adaptor, and real‑world AI applications across various products and services.

AI applicationsModel architecturevision-language

0 likes · 19 min read

Exploration and Practice of Multimodal Large Models at 360

360 Tech Engineering

May 17, 2024 · Artificial Intelligence

360VL: An Open‑Source Multimodal Large Language Model Based on Llama‑3‑70B

The article introduces 360VL, an open‑source multimodal large language model built on Llama‑3‑70B, describes its novel C‑abs bridge architecture for high‑resolution visual understanding, outlines the two‑stage training with bilingual data, and presents benchmark results showing superior performance over prior LMMs.

AI researchLlama3large language model

0 likes · 8 min read

360VL: An Open‑Source Multimodal Large Language Model Based on Llama‑3‑70B

21CTO

May 14, 2024 · Artificial Intelligence

What Makes OpenAI’s New GPT‑4o a Game‑Changing Multimodal AI?

OpenAI’s latest flagship model GPT‑4o combines text, audio, image and video processing in a single, faster, cheaper multimodal system that delivers near‑human response times, expanded API access, and new safety measures, reshaping how developers and users interact with AI.

AI modelAudio ProcessingGPT-4o

0 likes · 10 min read

What Makes OpenAI’s New GPT‑4o a Game‑Changing Multimodal AI?

DataFunSummit

Mar 27, 2024 · Artificial Intelligence

Generative Multimodal Pretraining (OFA) and Representational Multimodal Pretraining (ONE-PEACE): Research Overview and Findings

This article reviews Tongyi Lab's work on the OFA framework for generative multimodal pretraining and the ONE-PEACE model for unified multimodal representation learning, detailing their architectures, training strategies, experimental results across vision‑language and audio tasks, and future research directions.

OFAONE-PEACEmultimodal

0 likes · 15 min read

Generative Multimodal Pretraining (OFA) and Representational Multimodal Pretraining (ONE-PEACE): Research Overview and Findings

DataFunTalk

Sep 26, 2023 · Artificial Intelligence

MiniGPT-4: Enhancing Vision‑Language Understanding with Large Language Models

This article presents MiniGPT-4, a multimodal system that combines a frozen visual encoder (Q‑Former + ViT) with an open‑source large language model (Vicuna), describes its motivation, training pipeline, demo capabilities, observed limitations, and includes a brief Q&A session.

AI researchImage CaptioningMiniGPT-4

0 likes · 15 min read

MiniGPT-4: Enhancing Vision‑Language Understanding with Large Language Models

DataFunTalk

Aug 11, 2023 · Artificial Intelligence

Multimodal Dialogue Large Model mPLUG-Owl: Technology, Applications, and Evaluation

mPLUG-Owl is a modular multimodal dialogue large model from Alibaba DAMO Academy that builds on the mPLUG series, offering advanced image, video, OCR, and multilingual capabilities, with extensive evaluations showing superior performance over MiniGPT‑4, LLaVA, and other multimodal LLMs across various tasks.

Multimodal AIevaluationmPLUG-Owl

0 likes · 17 min read

Multimodal Dialogue Large Model mPLUG-Owl: Technology, Applications, and Evaluation

Alibaba Cloud Big Data AI Platform

Jul 11, 2023 · Artificial Intelligence

How FashionKLIP Boosts E‑Commerce Image‑Text Retrieval with a Multimodal Knowledge Graph

The ACL 2023 paper introduces FashionKLIP, an e‑commerce visual‑language model enhanced by a multimodal concept knowledge graph, detailing its automated knowledge graph construction, dual‑stream training strategy, and superior performance on FashionGen retrieval benchmarks compared to state‑of‑the‑art methods.

FashionKLIPKnowledge GraphMultimodal Retrieval

0 likes · 10 min read

How FashionKLIP Boosts E‑Commerce Image‑Text Retrieval with a Multimodal Knowledge Graph

DataFunTalk

Oct 13, 2022 · Artificial Intelligence

Multimodal Attribute-Level Sentiment Analysis for Social Media: Background, Tasks, and Recent Advances

This article reviews the rapid development of multimodal attribute-level sentiment analysis on social media, outlining its background, defining four core sub‑tasks, summarizing representative recent models—including unified multimodal transformers, coarse‑to‑fine image‑target matching, and vision‑language pre‑training—and discussing experimental results and future research directions.

Deep LearningNLPaspect based sentiment

0 likes · 21 min read

Multimodal Attribute-Level Sentiment Analysis for Social Media: Background, Tasks, and Recent Advances

DataFunSummit

Oct 9, 2022 · Artificial Intelligence

Understanding the GIT Image‑to‑Text Model: Architecture, Examples, and Performance Comparison

The article introduces the GIT image‑to‑text (image captioning) model, explains its transformer‑based architecture, showcases multiple example outputs, discusses training details, compares its performance with Flamingo and COCO, and highlights its applicability to tasks such as VQA, video captioning, and image classification.

GIT modelImage CaptioningMultimodal AI

0 likes · 12 min read

Understanding the GIT Image‑to‑Text Model: Architecture, Examples, and Performance Comparison

Baobao Algorithm Notes

Jun 7, 2022 · Artificial Intelligence

How CoCa Unifies Image Captioning and Contrastive Learning in Vision-Language Models

This article examines the CoCa model, explaining how it extends CLIP with image captioning by combining contrastive and generative objectives, detailing its architecture, training tricks, and performance gains on ImageNet and zero‑shot benchmarks.

CoCaImage Captioningvision-language

0 likes · 7 min read

How CoCa Unifies Image Captioning and Contrastive Learning in Vision-Language Models

DaTaobao Tech

May 24, 2022 · Artificial Intelligence

GEN-VLKT: Simplify Association and Enhance Interaction Understanding for HOI Detection

GEN‑VLKT introduces a Guided‑Embedding Network with position‑ and instance‑guided embeddings to remove costly post‑processing and leverages CLIP‑based visual‑linguistic knowledge transfer for interaction understanding, achieving state‑of‑the‑art HOI detection performance and zero‑shot capability, now deployed in Alibaba’s Taobao services.

CLIPHOI detectionTransformer

0 likes · 7 min read

GEN-VLKT: Simplify Association and Enhance Interaction Understanding for HOI Detection

DataFunTalk

Mar 20, 2020 · Artificial Intelligence

UNITER: Unified Image‑Text Representation Learning for Vision‑Language Tasks

This article introduces UNITER, a unified image‑text representation learning framework pretrained on four large multimodal datasets, describes its three pretraining tasks (MLM, ITM, MRM), details model architecture, training optimizations, and evaluates performance across six vision‑language downstream tasks, achieving state‑of‑the‑art results.

AIITMMLM

0 likes · 11 min read

UNITER: Unified Image‑Text Representation Learning for Vision‑Language Tasks