Tagged articles

visual language model

26 articles · Page 1 of 1

Jun 5, 2026 · Artificial Intelligence

How PaddleOCR‑VL‑1.6’s 0.9B Model Achieved 96.33% SOTA on OmniDocBench v1.6

PaddleOCR‑VL‑1.6, a compact 0.9B visual‑language model, diagnoses three types of weak regions, enriches targeted data, and applies a three‑stage CPT‑SFT‑RL training pipeline to reach a 96.33% overall score on OmniDocBench v1.6, surpassing much larger models across all document‑parsing tasks.

OmniDocBenchPaddleOCR-VL-1.6SOTA

0 likes · 10 min read

How PaddleOCR‑VL‑1.6’s 0.9B Model Achieved 96.33% SOTA on OmniDocBench v1.6

Machine Learning Algorithms & Natural Language Processing

Jun 4, 2026 · Artificial Intelligence

How CRAFTER Turns AI‑Generated Research Figures into Editable SVGs

The article analyzes CRAFTER and its companion CRAFTEDITOR, which together generate research diagrams with AI and convert raster outputs into fully editable SVGs, detailing their multi‑agent workflow, benchmark results, multi‑condition input support, and open‑source availability.

AI figure generationCRAFTEDITORCRAFTER

0 likes · 7 min read

How CRAFTER Turns AI‑Generated Research Figures into Editable SVGs

Machine Learning Algorithms & Natural Language Processing

Jun 3, 2026 · Artificial Intelligence

Can Multimodal Models Ditch Frame Sampling? LLaVA‑OneVision‑2.0’s Codec‑Stream

LLaVA‑OneVision‑2.0 replaces uniform frame sampling with a codec‑stream visual unit, integrates a OneVision‑Encoder that tokenizes video as state‑plus‑incremental evidence, and demonstrates consistent gains on 18 video, 11 spatial‑reasoning and 4 tracking benchmarks while open‑sourcing its model, data and code.

JumpScoreLLaVA-OneVision-2.0Multimodal

0 likes · 17 min read

Can Multimodal Models Ditch Frame Sampling? LLaVA‑OneVision‑2.0’s Codec‑Stream

AI Engineering

Apr 1, 2026 · Artificial Intelligence

Holo3 AI Model Beats GPT‑5.4 at One‑Tenth the Cost for Computer Use

H Company’s new Holo3 series delivers a visual language model that outperforms GPT‑5.4 on the OSWorld‑Verified benchmark with a 78.85% score while costing only about one‑tenth as much, offering both a flagship API‑only version and an open‑source lightweight variant optimized for GUI agents.

AI benchmarkGUI AgentHolo3

0 likes · 4 min read

Holo3 AI Model Beats GPT‑5.4 at One‑Tenth the Cost for Computer Use

SuanNi

Mar 15, 2026 · Artificial Intelligence

How LabClaw, LabOS, and MedOS Are Turning AI into a Collaborative Scientist

This article explores the LabClaw skill library, LabOS laboratory operating system, and MedOS surgical platform—detailing their modular AI capabilities, multi‑agent architectures, benchmark results, and how they together create a self‑evolving ecosystem that transforms AI into a real‑time collaborative scientist for biomedical research and clinical practice.

AIAutonomous AgentsXR

0 likes · 14 min read

How LabClaw, LabOS, and MedOS Are Turning AI into a Collaborative Scientist

PaperAgent

Feb 24, 2026 · Artificial Intelligence

How AI Agents Can Auto‑Generate High‑Quality Research Flowcharts

This article introduces PaperBanana, a multi‑agent AI framework that automates the creation of academic illustration by retrieving references, planning descriptions, styling, visualizing, and iteratively refining images, and evaluates its performance on the new PaperBananaBench benchmark against existing baselines.

AI illustrationAutomationacademic graphics

0 likes · 8 min read

How AI Agents Can Auto‑Generate High‑Quality Research Flowcharts

PaperAgent

Feb 2, 2026 · Artificial Intelligence

How Kimi K2.5 Achieves Multimodal Mastery with Joint Training and Agent Swarms

The Kimi K2.5 technical report reveals how a Chinese team combined joint text‑vision training, a novel Zero‑Vision SFT method, and a parallel agent‑swarm architecture to deliver top‑ranked multimodal performance, dramatically faster inference, and open‑source access for broader AI research.

AI researchAgent SwarmKimi K2.5

0 likes · 9 min read

How Kimi K2.5 Achieves Multimodal Mastery with Joint Training and Agent Swarms

Amap Tech

Oct 3, 2025 · Artificial Intelligence

How FantasyHSI Enables Autonomous 3D Human Interaction in Any Scene

FantasyHSI introduces a graph‑based multi‑agent framework that combines visual‑language models and video‑generation diffusion to let digital humans perceive, plan, and interact autonomously in any 3D scene, producing physically plausible, long‑duration actions for animation creation and embodied‑AI simulation.

3D synthesisGraph Modelinghuman-scene interaction

0 likes · 12 min read

How FantasyHSI Enables Autonomous 3D Human Interaction in Any Scene

Amap Tech

Sep 19, 2025 · Artificial Intelligence

How FSDrive Uses Spatio‑Temporal CoT to Revolutionize Autonomous Driving

FSDrive introduces a spatio‑temporal chain‑of‑thought approach that enables visual language models to generate future driving scenes as images, improving trajectory planning accuracy and safety by eliminating cross‑modal gaps and enforcing physical constraints in autonomous driving.

AI researchautonomous drivingspatio-temporal CoT

0 likes · 10 min read

How FSDrive Uses Spatio‑Temporal CoT to Revolutionize Autonomous Driving

Tencent Technical Engineering

Sep 12, 2025 · Artificial Intelligence

How POINTS-Reader Achieves State‑of‑the‑Art PDF Extraction Without Teacher Models

The POINTS-Reader paper, accepted at EMNLP 2025, introduces a two‑stage, fully automated data generation pipeline that enables a lightweight visual‑language model to extract text, tables, and LaTeX formulas from diverse PDF layouts with superior performance and high throughput, all without relying on costly teacher‑model distillation.

AIDocument ParsingOCR

0 likes · 12 min read

How POINTS-Reader Achieves State‑of‑the‑Art PDF Extraction Without Teacher Models

Baidu Geek Talk

Aug 25, 2025 · Artificial Intelligence

How ERNIE‑4.5‑VL Redefines Multimodal AI with 100+ Language Support

The ERNIE‑4.5‑VL visual‑language model breaks single‑modality limits by delivering breakthrough image, video, and text understanding across more than 100 languages, offering lightweight yet competitive performance against models like Qwen2.5‑VL, supporting 128K context, dual “thinking” modes, and extensive deployment resources.

AI researchERNIELarge Language Model

0 likes · 4 min read

How ERNIE‑4.5‑VL Redefines Multimodal AI with 100+ Language Support

Xiaohongshu Tech REDtech

Jul 31, 2025 · Artificial Intelligence

How dots.ocr Achieves SOTA Multilingual Document Parsing with a 1.7B VLM

dots.ocr is a 1.7 billion-parameter multilingual document-parsing model that unifies layout detection and content recognition within a single visual-language model, delivering state-of-the-art performance across text, tables, formulas and reading order while remaining efficient and extensible for future multimodal AI research.

AIDocument ParsingOCR

0 likes · 10 min read

How dots.ocr Achieves SOTA Multilingual Document Parsing with a 1.7B VLM

DataFunSummit

Jul 23, 2025 · Artificial Intelligence

Multimodal RAG: Techniques, Challenges, and Scaling the Future of AI

This article presents a comprehensive overview of multimodal Retrieval‑Augmented Generation (RAG), detailing three implementation paths—semantic extraction, Transformer‑based, and Visual Language Model approaches—along with scaling strategies using tensor indexing, performance comparisons, and guidance on selecting the most suitable technical route.

AI RetrievalDocument processingMultimodal RAG

0 likes · 12 min read

Multimodal RAG: Techniques, Challenges, and Scaling the Future of AI

AI Algorithm Path

Jul 20, 2025 · Artificial Intelligence

How to Build an Open‑Set Object Detection Workflow: A Comprehensive Guide

This article presents a step‑by‑step agentic object detection pipeline that combines open‑vocabulary detectors such as Grounding‑DINO with visual language models (GPT‑4o, o1) for concept extraction, critique, refinement, and validation, complete with code snippets, design rationale, and real‑world examples.

Grounding DINOOpen-Vocabulary DetectionPython

0 likes · 33 min read

How to Build an Open‑Set Object Detection Workflow: A Comprehensive Guide

DataFunTalk

Jul 2, 2025 · Artificial Intelligence

How GLM-4.1V-Thinking Sets New Standards in Multimodal AI Reasoning

Zhipu AI unveiled the GLM-4.1V-Thinking series, an open‑source multimodal model that outperforms larger rivals on visual‑language tasks, supports video analysis, GUI agents, and advanced scientific reasoning, while introducing a curriculum‑sampling reinforcement‑learning framework and a new Agent application platform.

AI AgentsGLM-4.1VMultimodal AI

0 likes · 10 min read

How GLM-4.1V-Thinking Sets New Standards in Multimodal AI Reasoning

AI Frontier Lectures

May 19, 2025 · Artificial Intelligence

How SuperEdit Boosts Instruction-Based Image Editing with Rectified Supervision

SuperEdit introduces rectified instruction generation and contrastive supervision to fix noisy training signals in instruction‑based image editing, achieving up to 9.19% performance gains without extra parameters or pre‑training, as demonstrated on the Real‑Edit benchmark.

Diffusion Modelsimage editingsupervision

0 likes · 13 min read

How SuperEdit Boosts Instruction-Based Image Editing with Rectified Supervision

58 Tech

Apr 11, 2025 · Artificial Intelligence

Optimization of Multimodal Visual Large Model Inference: Pre‑processing, ViT TensorRT, CUDA Graphs, Tokenization, Prefix Cache, and Quantization

This report details a comprehensive set of optimizations for multimodal visual large‑model (VLM) inference—including image pre‑processing acceleration, TensorRT integration for the ViT module, CUDA‑Graph replay, token‑count reduction, prefix‑cache handling, and weight quantization—demonstrating up to three‑fold throughput gains while maintaining accuracy.

CUDA GraphMultimodalQuantization

0 likes · 19 min read

Optimization of Multimodal Visual Large Model Inference: Pre‑processing, ViT TensorRT, CUDA Graphs, Tokenization, Prefix Cache, and Quantization

MaGe Linux Operations

Mar 26, 2025 · Artificial Intelligence

Why Qwen2.5‑VL‑32B Is the New AI Breakthrough for Vision and Math

Alibaba's newly released Qwen2.5‑VL‑32B multimodal model delivers state‑of‑the‑art visual and textual performance, offering human‑aligned responses, superior mathematical reasoning, fine‑grained image understanding, and efficient deployment features that make it a compelling tool for developers and AI researchers alike.

AI researchLarge Language ModelQwen2.5-VL-32B

0 likes · 9 min read

Why Qwen2.5‑VL‑32B Is the New AI Breakthrough for Vision and Math

AIWalker

Mar 18, 2025 · Artificial Intelligence

How ImageRAG Boosts Text‑to‑Image Generation with Retrieval‑Augmented Generation

ImageRAG introduces a retrieval‑augmented generation framework that dynamically fetches relevant images to guide diffusion models, dramatically improving the synthesis of rare and fine‑grained concepts across multiple text‑to‑image systems, as demonstrated by extensive quantitative and user studies.

AI generationDiffusion ModelsImageRAG

0 likes · 17 min read

How ImageRAG Boosts Text‑to‑Image Generation with Retrieval‑Augmented Generation

DataFunSummit

Feb 21, 2025 · Artificial Intelligence

Multimodal Retrieval‑Augmented Generation (RAG): Implementation Paths and Future Prospects

This article explores multimodal Retrieval‑Augmented Generation (RAG), detailing five core topics—including semantic extraction, visual‑language models, scaling strategies, technical roadmap choices, and a Q&A—while presenting three implementation pathways, performance evaluations, and future directions for AI‑driven document understanding.

Multimodal AIRAGTensor Retrieval

0 likes · 11 min read

Multimodal Retrieval‑Augmented Generation (RAG): Implementation Paths and Future Prospects

AIWalker

Feb 14, 2025 · Artificial Intelligence

ImageRAG: Leveraging RAG and AIGC to Elevate Image Generation Quality

ImageRAG introduces a dynamic retrieval‑augmented generation framework that integrates visual language models and CLIP‑based similarity search to supply reference images, enabling diffusion models like OmniGen and SDXL to better render rare and fine‑grained concepts, as demonstrated through extensive quantitative and qualitative experiments.

AIGCDiffusion ModelsImageRAG

0 likes · 18 min read

ImageRAG: Leveraging RAG and AIGC to Elevate Image Generation Quality

Sohu Tech Products

Jan 8, 2025 · Artificial Intelligence

Multimodal RAG: Implementation Paths and Development Prospects

The talk outlines Multimodal RAG implementation routes, comparing OCR‑based object recognition, transformer encoder‑decoder encoding, and Visual Language Model processing, explains the ColPali late‑interaction method for multi‑dimensional vector matching, addresses scaling tensors with binarization and reranking, and recommends a hybrid long‑term strategy where VLM excels on abstract imagery while traditional OCR remains valuable.

ColPaliDocument processingMultimodal RAG

0 likes · 10 min read

Multimodal RAG: Implementation Paths and Development Prospects

NewBeeNLP

Jan 2, 2025 · Artificial Intelligence

Unlocking Multimodal RAG: From Semantic Extraction to Scalable VLM Solutions

This article examines the implementation paths and future prospects of multimodal Retrieval‑Augmented Generation, covering semantic extraction, transformer‑based OCR, visual language models, scaling challenges, tensor indexing, and practical evaluations with tools like Infinity and ColPali.

AI RetrievalInfinity DatabaseMultimodal RAG

0 likes · 12 min read

Unlocking Multimodal RAG: From Semantic Extraction to Scalable VLM Solutions

Baobao Algorithm Notes

Jul 4, 2024 · Artificial Intelligence

Vitron: How a Pixel‑Level Multimodal LLM Bridges Vision and Language

Vitron is a unified pixel‑level visual multimodal large language model that integrates image, video, and region encoders with a text‑centric strategy, delivering precise pixel‑wise perception and a comprehensive suite of vision tasks from understanding to generation and editing.

AILLMMultimodal

0 likes · 12 min read

Vitron: How a Pixel‑Level Multimodal LLM Bridges Vision and Language

21CTO

May 21, 2024 · Artificial Intelligence

How Google’s ScreenAI Could Redefine UI Understanding and UX Design

Google’s new ScreenAI visual‑language model, built on the PaLI architecture, can interpret user interfaces and infographics, answer UI‑related questions, generate summaries and navigate screens, and sets new benchmarks that may reshape future user‑experience research and applications.

Google AIMultimodal AIScreenAI

0 likes · 9 min read

How Google’s ScreenAI Could Redefine UI Understanding and UX Design

21CTO

Jan 31, 2024 · Artificial Intelligence

Unlocking LLaVA: A Hands‑On Guide to the Open‑Source Visual Language Model

This article introduces LLaVA, an open‑source large language‑visual assistant that replicates GPT‑4‑V capabilities, explains its architecture, training process, and key features, and provides step‑by‑step instructions for using the web demo, running it locally via Ollama or HuggingFace, and building a simple Gradio chatbot with code examples.

GradioLLaVAMultimodal AI

0 likes · 11 min read

Unlocking LLaVA: A Hands‑On Guide to the Open‑Source Visual Language Model