Tagged articles
23 articles
Page 1 of 1
AI Engineering
AI Engineering
Apr 1, 2026 · Artificial Intelligence

Holo3 AI Model Beats GPT‑5.4 at One‑Tenth the Cost for Computer Use

H Company’s new Holo3 series delivers a visual language model that outperforms GPT‑5.4 on the OSWorld‑Verified benchmark with a 78.85% score while costing only about one‑tenth as much, offering both a flagship API‑only version and an open‑source lightweight variant optimized for GUI agents.

AI BenchmarkGUI AgentHolo3
0 likes · 4 min read
Holo3 AI Model Beats GPT‑5.4 at One‑Tenth the Cost for Computer Use
SuanNi
SuanNi
Mar 15, 2026 · Artificial Intelligence

How LabClaw, LabOS, and MedOS Are Turning AI into a Collaborative Scientist

This article explores the LabClaw skill library, LabOS laboratory operating system, and MedOS surgical platform—detailing their modular AI capabilities, multi‑agent architectures, benchmark results, and how they together create a self‑evolving ecosystem that transforms AI into a real‑time collaborative scientist for biomedical research and clinical practice.

AIAutonomous AgentsBiomedical Research
0 likes · 14 min read
How LabClaw, LabOS, and MedOS Are Turning AI into a Collaborative Scientist
PaperAgent
PaperAgent
Feb 24, 2026 · Artificial Intelligence

How AI Agents Can Auto‑Generate High‑Quality Research Flowcharts

This article introduces PaperBanana, a multi‑agent AI framework that automates the creation of academic illustration by retrieving references, planning descriptions, styling, visualizing, and iteratively refining images, and evaluates its performance on the new PaperBananaBench benchmark against existing baselines.

AI illustrationAutomationBenchmark
0 likes · 8 min read
How AI Agents Can Auto‑Generate High‑Quality Research Flowcharts
PaperAgent
PaperAgent
Feb 2, 2026 · Artificial Intelligence

How Kimi K2.5 Achieves Multimodal Mastery with Joint Training and Agent Swarms

The Kimi K2.5 technical report reveals how a Chinese team combined joint text‑vision training, a novel Zero‑Vision SFT method, and a parallel agent‑swarm architecture to deliver top‑ranked multimodal performance, dramatically faster inference, and open‑source access for broader AI research.

AI researchAgent SwarmKimi-K2.5
0 likes · 9 min read
How Kimi K2.5 Achieves Multimodal Mastery with Joint Training and Agent Swarms
Amap Tech
Amap Tech
Oct 3, 2025 · Artificial Intelligence

How FantasyHSI Enables Autonomous 3D Human Interaction in Any Scene

FantasyHSI introduces a graph‑based multi‑agent framework that combines visual‑language models and video‑generation diffusion to let digital humans perceive, plan, and interact autonomously in any 3D scene, producing physically plausible, long‑duration actions for animation creation and embodied‑AI simulation.

3D synthesisGraph ModelingVideo Generation
0 likes · 12 min read
How FantasyHSI Enables Autonomous 3D Human Interaction in Any Scene
Amap Tech
Amap Tech
Sep 19, 2025 · Artificial Intelligence

How FSDrive Uses Spatio‑Temporal CoT to Revolutionize Autonomous Driving

FSDrive introduces a spatio‑temporal chain‑of‑thought approach that enables visual language models to generate future driving scenes as images, improving trajectory planning accuracy and safety by eliminating cross‑modal gaps and enforcing physical constraints in autonomous driving.

AI researchautonomous drivingspatio-temporal CoT
0 likes · 10 min read
How FSDrive Uses Spatio‑Temporal CoT to Revolutionize Autonomous Driving
Tencent Technical Engineering
Tencent Technical Engineering
Sep 12, 2025 · Artificial Intelligence

How POINTS-Reader Achieves State‑of‑the‑Art PDF Extraction Without Teacher Models

The POINTS-Reader paper, accepted at EMNLP 2025, introduces a two‑stage, fully automated data generation pipeline that enables a lightweight visual‑language model to extract text, tables, and LaTeX formulas from diverse PDF layouts with superior performance and high throughput, all without relying on costly teacher‑model distillation.

AIDocument ParsingOCR
0 likes · 12 min read
How POINTS-Reader Achieves State‑of‑the‑Art PDF Extraction Without Teacher Models
Baidu Geek Talk
Baidu Geek Talk
Aug 25, 2025 · Artificial Intelligence

How ERNIE‑4.5‑VL Redefines Multimodal AI with 100+ Language Support

The ERNIE‑4.5‑VL visual‑language model breaks single‑modality limits by delivering breakthrough image, video, and text understanding across more than 100 languages, offering lightweight yet competitive performance against models like Qwen2.5‑VL, supporting 128K context, dual “thinking” modes, and extensive deployment resources.

AI researchErnieMultimodal AI
0 likes · 4 min read
How ERNIE‑4.5‑VL Redefines Multimodal AI with 100+ Language Support
Xiaohongshu Tech REDtech
Xiaohongshu Tech REDtech
Jul 31, 2025 · Artificial Intelligence

How dots.ocr Achieves SOTA Multilingual Document Parsing with a 1.7B VLM

dots.ocr is a 1.7 billion-parameter multilingual document-parsing model that unifies layout detection and content recognition within a single visual-language model, delivering state-of-the-art performance across text, tables, formulas and reading order while remaining efficient and extensible for future multimodal AI research.

AIBenchmarkDocument Parsing
0 likes · 10 min read
How dots.ocr Achieves SOTA Multilingual Document Parsing with a 1.7B VLM
DataFunSummit
DataFunSummit
Jul 23, 2025 · Artificial Intelligence

Multimodal RAG: Techniques, Challenges, and Scaling the Future of AI

This article presents a comprehensive overview of multimodal Retrieval‑Augmented Generation (RAG), detailing three implementation paths—semantic extraction, Transformer‑based, and Visual Language Model approaches—along with scaling strategies using tensor indexing, performance comparisons, and guidance on selecting the most suitable technical route.

AI RetrievalDocument ProcessingMultimodal RAG
0 likes · 12 min read
Multimodal RAG: Techniques, Challenges, and Scaling the Future of AI
AI Algorithm Path
AI Algorithm Path
Jul 20, 2025 · Artificial Intelligence

How to Build an Open‑Set Object Detection Workflow: A Comprehensive Guide

This article presents a step‑by‑step agentic object detection pipeline that combines open‑vocabulary detectors such as Grounding‑DINO with visual language models (GPT‑4o, o1) for concept extraction, critique, refinement, and validation, complete with code snippets, design rationale, and real‑world examples.

Grounding DINOPipelinePython
0 likes · 33 min read
How to Build an Open‑Set Object Detection Workflow: A Comprehensive Guide
DataFunTalk
DataFunTalk
Jul 2, 2025 · Artificial Intelligence

How GLM-4.1V-Thinking Sets New Standards in Multimodal AI Reasoning

Zhipu AI unveiled the GLM-4.1V-Thinking series, an open‑source multimodal model that outperforms larger rivals on visual‑language tasks, supports video analysis, GUI agents, and advanced scientific reasoning, while introducing a curriculum‑sampling reinforcement‑learning framework and a new Agent application platform.

AI agentsGLM-4.1VMultimodal AI
0 likes · 10 min read
How GLM-4.1V-Thinking Sets New Standards in Multimodal AI Reasoning
58 Tech
58 Tech
Apr 11, 2025 · Artificial Intelligence

Optimization of Multimodal Visual Large Model Inference: Pre‑processing, ViT TensorRT, CUDA Graphs, Tokenization, Prefix Cache, and Quantization

This report details a comprehensive set of optimizations for multimodal visual large‑model (VLM) inference—including image pre‑processing acceleration, TensorRT integration for the ViT module, CUDA‑Graph replay, token‑count reduction, prefix‑cache handling, and weight quantization—demonstrating up to three‑fold throughput gains while maintaining accuracy.

CUDA GraphTensorRTinference-optimization
0 likes · 19 min read
Optimization of Multimodal Visual Large Model Inference: Pre‑processing, ViT TensorRT, CUDA Graphs, Tokenization, Prefix Cache, and Quantization
MaGe Linux Operations
MaGe Linux Operations
Mar 26, 2025 · Artificial Intelligence

Why Qwen2.5‑VL‑32B Is the New AI Breakthrough for Vision and Math

Alibaba's newly released Qwen2.5‑VL‑32B multimodal model delivers state‑of‑the‑art visual and textual performance, offering human‑aligned responses, superior mathematical reasoning, fine‑grained image understanding, and efficient deployment features that make it a compelling tool for developers and AI researchers alike.

AI researchQwen2.5-VL-32Blarge language model
0 likes · 9 min read
Why Qwen2.5‑VL‑32B Is the New AI Breakthrough for Vision and Math
AIWalker
AIWalker
Mar 18, 2025 · Artificial Intelligence

How ImageRAG Boosts Text‑to‑Image Generation with Retrieval‑Augmented Generation

ImageRAG introduces a retrieval‑augmented generation framework that dynamically fetches relevant images to guide diffusion models, dramatically improving the synthesis of rare and fine‑grained concepts across multiple text‑to‑image systems, as demonstrated by extensive quantitative and user studies.

AI GenerationBenchmarkImageRAG
0 likes · 17 min read
How ImageRAG Boosts Text‑to‑Image Generation with Retrieval‑Augmented Generation
DataFunSummit
DataFunSummit
Feb 21, 2025 · Artificial Intelligence

Multimodal Retrieval‑Augmented Generation (RAG): Implementation Paths and Future Prospects

This article explores multimodal Retrieval‑Augmented Generation (RAG), detailing five core topics—including semantic extraction, visual‑language models, scaling strategies, technical roadmap choices, and a Q&A—while presenting three implementation pathways, performance evaluations, and future directions for AI‑driven document understanding.

Multimodal AIRAGTensor Retrieval
0 likes · 11 min read
Multimodal Retrieval‑Augmented Generation (RAG): Implementation Paths and Future Prospects
AIWalker
AIWalker
Feb 14, 2025 · Artificial Intelligence

ImageRAG: Leveraging RAG and AIGC to Elevate Image Generation Quality

ImageRAG introduces a dynamic retrieval‑augmented generation framework that integrates visual language models and CLIP‑based similarity search to supply reference images, enabling diffusion models like OmniGen and SDXL to better render rare and fine‑grained concepts, as demonstrated through extensive quantitative and qualitative experiments.

AIGCImageRAGOmniGen
0 likes · 18 min read
ImageRAG: Leveraging RAG and AIGC to Elevate Image Generation Quality
Sohu Tech Products
Sohu Tech Products
Jan 8, 2025 · Artificial Intelligence

Multimodal RAG: Implementation Paths and Development Prospects

The talk outlines Multimodal RAG implementation routes, comparing OCR‑based object recognition, transformer encoder‑decoder encoding, and Visual Language Model processing, explains the ColPali late‑interaction method for multi‑dimensional vector matching, addresses scaling tensors with binarization and reranking, and recommends a hybrid long‑term strategy where VLM excels on abstract imagery while traditional OCR remains valuable.

ColPaliDocument ProcessingMultimodal RAG
0 likes · 10 min read
Multimodal RAG: Implementation Paths and Development Prospects
NewBeeNLP
NewBeeNLP
Jan 2, 2025 · Artificial Intelligence

Unlocking Multimodal RAG: From Semantic Extraction to Scalable VLM Solutions

This article examines the implementation paths and future prospects of multimodal Retrieval‑Augmented Generation, covering semantic extraction, transformer‑based OCR, visual language models, scaling challenges, tensor indexing, and practical evaluations with tools like Infinity and ColPali.

AI RetrievalInfinity DatabaseMultimodal RAG
0 likes · 12 min read
Unlocking Multimodal RAG: From Semantic Extraction to Scalable VLM Solutions
21CTO
21CTO
May 21, 2024 · Artificial Intelligence

How Google’s ScreenAI Could Redefine UI Understanding and UX Design

Google’s new ScreenAI visual‑language model, built on the PaLI architecture, can interpret user interfaces and infographics, answer UI‑related questions, generate summaries and navigate screens, and sets new benchmarks that may reshape future user‑experience research and applications.

Google AIMultimodal AIScreenAI
0 likes · 9 min read
How Google’s ScreenAI Could Redefine UI Understanding and UX Design
21CTO
21CTO
Jan 31, 2024 · Artificial Intelligence

Unlocking LLaVA: A Hands‑On Guide to the Open‑Source Visual Language Model

This article introduces LLaVA, an open‑source large language‑visual assistant that replicates GPT‑4‑V capabilities, explains its architecture, training process, and key features, and provides step‑by‑step instructions for using the web demo, running it locally via Ollama or HuggingFace, and building a simple Gradio chatbot with code examples.

GradioLLaVAMultimodal AI
0 likes · 11 min read
Unlocking LLaVA: A Hands‑On Guide to the Open‑Source Visual Language Model