Tagged articles

VLM

12 articles · Page 1 of 1

Jun 23, 2026 · Artificial Intelligence

Doubao Model 2.1 Launch: Production‑Grade End‑to‑End Coding and Multi‑Agent Breakthrough

Doubao's Model 2.1, unveiled at the Force conference, pushes daily token usage past 180 trillion, captures 49.5% of China's public‑cloud MaaS market, tops code and agent benchmarks, delivers repository‑level coding, advanced multi‑modal reasoning, and introduces cost‑effective Pro and Turbo variants with a new Deep Think inference mode.

AI benchmarkingDoubaoLLM

0 likes · 11 min read

Doubao Model 2.1 Launch: Production‑Grade End‑to‑End Coding and Multi‑Agent Breakthrough

Qunhe Technology Quality Tech

Jun 23, 2026 · Artificial Intelligence

Why Pixel Diff Failed and How VLM Fine‑Tuning Became the Eyes of UI Automation

Traditional pixel‑by‑pixel UI comparison breaks on complex CAD drawings due to semantic changes, so a team built a visual‑language‑model fine‑tuning pipeline that turns failure cases into training data, achieves ~95% AI accuracy, improves regression efficiency by over 40%, and now powers hundreds of daily automation tests.

AI monitoringUI automationVLM

0 likes · 12 min read

Why Pixel Diff Failed and How VLM Fine‑Tuning Became the Eyes of UI Automation

Machine Learning Algorithms & Natural Language Processing

Jun 18, 2026 · Artificial Intelligence

UniRL: Tencent Hunyuan’s Open‑Source Framework Unifying Multimodal RL Training

UniRL is an open‑source, distributed reinforcement‑learning post‑training framework that consolidates fragmented pipelines for image, video, and language‑vision models, offering a unified rollout‑reward‑advantage‑train‑sync contract, extensive model support, built‑in algorithms, and multi‑modal reward components to lower engineering barriers in AIGC research.

Diffusion ModelsLLMMultimodal RL

0 likes · 10 min read

UniRL: Tencent Hunyuan’s Open‑Source Framework Unifying Multimodal RL Training

Machine Heart

Jun 9, 2026 · Artificial Intelligence

Why Standard Vision‑Language Models + Scale Data Beat Specialized 3D Vision Designs (VLM³)

Meta’s VLM³ demonstrates that a plain vision‑language model, when trained on large‑scale data with simple camera‑focal‑length and pixel‑space normalization, matches or surpasses expert 3D vision models across monocular depth estimation, object‑level understanding, pixel‑matching and camera‑pose tasks, eliminating the need for task‑specific architectures, loss functions, data augmentations or regression formulations.

3D VisionDepth EstimationMeta

0 likes · 6 min read

Why Standard Vision‑Language Models + Scale Data Beat Specialized 3D Vision Designs (VLM³)

Machine Heart

May 31, 2026 · Artificial Intelligence

How a Near‑Invisible Image Can Make GPT‑5.4 and Claude Opus 4.6 Spread False Claims

Researchers from ETH Zurich show that tiny, human‑imperceptible perturbations to a single image can fool leading visual language models—including GPT‑5.4, Claude Opus 4.6, and Grok—into confidently delivering fabricated answers, enabling misinformation amplification, defamation, content‑filter evasion, and large‑scale AI authority laundering.

AI safetyClaude OpusGPT-5.4

0 likes · 7 min read

How a Near‑Invisible Image Can Make GPT‑5.4 and Claude Opus 4.6 Spread False Claims

AI Engineer Programming

May 9, 2026 · Artificial Intelligence

Why PDF Parsing Is Hard for RAG and Which Mainstream Solutions Work

The article examines the intrinsic challenges of extracting structured text from PDFs for Retrieval‑Augmented Generation—such as missing reading order, table reconstruction, font encoding, and scanned images—and compares lightweight libraries, AI‑enhanced frameworks, commercial APIs, and visual language models as practical solutions.

AI frameworksOCRPDF parsing

0 likes · 23 min read

Why PDF Parsing Is Hard for RAG and Which Mainstream Solutions Work

Old Zhang's AI Learning

Jan 30, 2026 · Artificial Intelligence

PaddleOCR‑VL‑1.5: 0.9B Model Beats Billion‑Parameter OCR Models with 94.5% Accuracy

PaddleOCR‑VL‑1.5, the latest Baidu release, uses only 0.9 B parameters to achieve 94.5% accuracy on OmniDocBench v1.5, surpassing larger open‑source and commercial OCR models, while offering multi‑task, multi‑language support, lightweight deployment, and detailed performance benchmarks.

DeepSeek-OCRGPU inferenceOCR

0 likes · 9 min read

PaddleOCR‑VL‑1.5: 0.9B Model Beats Billion‑Parameter OCR Models with 94.5% Accuracy

Data Party THU

Nov 5, 2025 · Artificial Intelligence

How VLM‑FO1 Turns Vision‑Language Models into Precise Perception Machines

VLM‑FO1 introduces a generate‑plus‑reference paradigm that replaces coordinate generation with region token referencing, adding plug‑in modules such as a proposal generator, a hybrid fine‑grained encoder, and a region‑language connector to give any pretrained visual language model accurate, fine‑grained perception while preserving its original capabilities.

AI researchMultimodalPlug-and-Play

0 likes · 15 min read

How VLM‑FO1 Turns Vision‑Language Models into Precise Perception Machines

AI Algorithm Path

Jul 20, 2025 · Artificial Intelligence

How to Build an Open‑Set Object Detection Workflow: A Comprehensive Guide

This article presents a step‑by‑step agentic object detection pipeline that combines open‑vocabulary detectors such as Grounding‑DINO with visual language models (GPT‑4o, o1) for concept extraction, critique, refinement, and validation, complete with code snippets, design rationale, and real‑world examples.

Grounding DINOOpen-Vocabulary DetectionPython

0 likes · 33 min read

How to Build an Open‑Set Object Detection Workflow: A Comprehensive Guide

Sohu Tech Products

Jan 8, 2025 · Artificial Intelligence

Multimodal RAG: Implementation Paths and Development Prospects

The talk outlines Multimodal RAG implementation routes, comparing OCR‑based object recognition, transformer encoder‑decoder encoding, and Visual Language Model processing, explains the ColPali late‑interaction method for multi‑dimensional vector matching, addresses scaling tensors with binarization and reranking, and recommends a hybrid long‑term strategy where VLM excels on abstract imagery while traditional OCR remains valuable.

ColPaliDocument processingMultimodal RAG

0 likes · 10 min read

Multimodal RAG: Implementation Paths and Development Prospects

Baobao Algorithm Notes

Nov 18, 2024 · Artificial Intelligence

Boosting Vision‑Language Model Performance: Prompt‑First vs. Fine‑Tuning Strategies

This guide explains when to rely on prompt engineering versus SFT fine‑tuning for Vision‑Language Models, emphasizing data quality, appropriate dataset sizes, training epochs, hyper‑parameter tuning, and practical steps to build robust VLM pipelines.

.aiDPOData Quality

0 likes · 10 min read

Boosting Vision‑Language Model Performance: Prompt‑First vs. Fine‑Tuning Strategies

AI Large Model Application Practice

Oct 14, 2024 · Artificial Intelligence

Build a Multimodal RAG Pipeline with Kotaemon, Azure Document Intelligence, and VLM

This guide walks through setting up the open‑source Kotaemon framework, configuring Azure Document Intelligence and a visual large model, and implementing code to extract and caption images and tables from PDFs for end‑to‑end multimodal RAG applications.

AzurePythonRAG

0 likes · 12 min read

Build a Multimodal RAG Pipeline with Kotaemon, Azure Document Intelligence, and VLM