Tagged articles

document understanding

21 articles · Page 1 of 1

Apr 29, 2026 · Artificial Intelligence

Doc‑V*: Reading Only 5 Pages Beats RAG on 80‑Page Docs – 10 Key Insights

Doc‑V* introduces a dynamic, thumbnail‑driven approach that lets a model decide which pages to read, achieving a 49.7% improvement over RAG variants on multi‑page document QA benchmarks without larger models or longer context windows, and demonstrates how strategic evidence acquisition outperforms naïve full‑document reading.

AIRAGdocument understanding

0 likes · 10 min read

Doc‑V*: Reading Only 5 Pages Beats RAG on 80‑Page Docs – 10 Key Insights

Xiaomi Tech

Apr 10, 2026 · Artificial Intelligence

Xiaomi AI’s 8× Faster Mobile Inference and OCR‑Free 80‑Page Document Understanding at ACL 2026

Xiaomi’s AI team announced seven ACL 2026 papers that span low‑bit KV‑cache quantization for 8.3× faster LLM inference, OCR‑free multi‑page document VQA, a new attention‑basin analysis, non‑autoregressive spoken dialogue generation, a comprehensive mobile‑agent benchmark, a success‑rate‑aware training policy, and a progressive universal information‑extraction framework.

Inference OptimizationLarge Language Modelsbenchmark

0 likes · 12 min read

Xiaomi AI’s 8× Faster Mobile Inference and OCR‑Free 80‑Page Document Understanding at ACL 2026

Old Zhang's AI Learning

Jan 31, 2026 · Artificial Intelligence

How a 0.1B‑Parameter OCR Model Beats Multi‑Billion‑Parameter Vision‑Language Models

UniRec‑0.1B, a lightweight OCR model with only 0.1 B parameters, achieves accuracy comparable to or better than multi‑billion‑parameter visual‑language models across text, formula, and mixed‑content tasks, thanks to hierarchical supervision training, a semantic‑decoupled tokenizer, and a large 40 M‑sample dataset, while delivering 2‑9× faster inference and full open‑source availability.

Hierarchical SupervisionOCROpen Source

0 likes · 12 min read

How a 0.1B‑Parameter OCR Model Beats Multi‑Billion‑Parameter Vision‑Language Models

HyperAI Super Neural

Jan 30, 2026 · Artificial Intelligence

Frontier OCR Advances: DeepSeek, Tencent, and Baidu Push From Text Recognition to Structured Document Understanding

This weekly AI paper roundup reviews five cutting‑edge OCR studies—DeepSeek‑OCR 2, LightOnOCR‑2‑1B, HunyuanOCR, PaddleOCR‑VL, and GOT—detailing their novel visual‑language architectures, training data, benchmark evaluations, and performance gains over previous models.

DeepSeekGoTLightOnOCR

0 likes · 9 min read

Frontier OCR Advances: DeepSeek, Tencent, and Baidu Push From Text Recognition to Structured Document Understanding

PaperAgent

Jan 27, 2026 · Artificial Intelligence

How DeepSeek-OCR 2’s Dual-Flow Attention Redefines Document Understanding

DeepSeek-OCR 2 introduces a novel dual‑stream (bidirectional + causal) attention architecture that replaces fixed raster scanning, leverages a Qwen2‑0.5B encoder, and achieves state‑of‑the‑art accuracy on OmniDocBench while reducing token budget and improving reading‑order consistency.

DeepEncoderDeepSeekDual-Stream Attention

0 likes · 8 min read

How DeepSeek-OCR 2’s Dual-Flow Attention Redefines Document Understanding

vivo Internet Technology

Sep 10, 2025 · Artificial Intelligence

How Structured Input Boosts Multimodal LLMs in Document QA Without Retraining

This article presents a training‑free, architecture‑agnostic method that leverages LaTeX‑style structured inputs to preserve document hierarchy and spatial relationships, thereby improving multimodal large language model performance on document question answering tasks across multiple benchmarks.

AIDocQAattention analysis

0 likes · 8 min read

How Structured Input Boosts Multimodal LLMs in Document QA Without Retraining

AntTech

Apr 10, 2025 · Artificial Intelligence

Ant Group Presents Four AI Research Papers at ICLR 2025 Live Showcase

At the ICLR 2025 live session in Singapore, Ant Group showcased four cutting‑edge papers—CodePlan, Animate‑X, Group Position Embedding, and OmniKV—demonstrating advances in large‑language‑model reasoning, universal character animation, layout‑aware document understanding, and efficient long‑context inference.

AI researchLarge Language ModelsLong Context

0 likes · 6 min read

Ant Group Presents Four AI Research Papers at ICLR 2025 Live Showcase

AI Frontier Lectures

Mar 7, 2025 · Artificial Intelligence

Can Mistral’s New OCR Model Really Beat the Competition? A Deep Dive

Mistral AI’s newly launched OCR API claims to deliver world‑class document understanding with multilingual support, high speed, and self‑hosting options, and benchmark tests show it outperforms Azure OCR and Google Doc AI, yet independent evaluations reveal limitations on complex tables and legal forms, prompting a balanced assessment of its readiness for enterprise use.

AI modelMistral AIOCR

0 likes · 7 min read

Can Mistral’s New OCR Model Really Beat the Competition? A Deep Dive

DataFunSummit

Feb 21, 2025 · Artificial Intelligence

Multimodal Retrieval‑Augmented Generation (RAG): Implementation Paths and Future Prospects

This article explores multimodal Retrieval‑Augmented Generation (RAG), detailing five core topics—including semantic extraction, visual‑language models, scaling strategies, technical roadmap choices, and a Q&A—while presenting three implementation pathways, performance evaluations, and future directions for AI‑driven document understanding.

Multimodal AIRAGTensor Retrieval

0 likes · 11 min read

Multimodal Retrieval‑Augmented Generation (RAG): Implementation Paths and Future Prospects

Baidu Geek Talk

Jan 6, 2025 · Information Security

MarkupLM-based Detection of Malicious Content Scraping

The article presents a MarkupLM‑based approach that enriches BERT with XPath embeddings to jointly model webpage text and structure, enabling site‑level detection of malicious content‑scraping pages that bypass traditional rule‑based filters and demonstrating the critical role of structural cues in improving spam classification accuracy.

MarkupLMXPath embeddingcontent scraping detection

0 likes · 16 min read

MarkupLM-based Detection of Malicious Content Scraping

NewBeeNLP

Jan 2, 2025 · Artificial Intelligence

Unlocking Multimodal RAG: From Semantic Extraction to Scalable VLM Solutions

This article examines the implementation paths and future prospects of multimodal Retrieval‑Augmented Generation, covering semantic extraction, transformer‑based OCR, visual language models, scaling challenges, tensor indexing, and practical evaluations with tools like Infinity and ColPali.

AI RetrievalInfinity DatabaseMultimodal RAG

0 likes · 12 min read

Unlocking Multimodal RAG: From Semantic Extraction to Scalable VLM Solutions

360 Tech Engineering

Nov 15, 2024 · Artificial Intelligence

Advances in Multimodal Large Models and Document Understanding Presented at the 2024 Global Machine Learning Conference (Beijing)

At the 2024 Global Machine Learning Conference in Beijing, 360 AI Research Institute showcased cutting‑edge multimodal large‑model research, fine‑grained open‑world object detection, and document understanding technologies, highlighting open‑source releases, real‑world deployments, and competitive achievements in AI competitions.

AI researchKnowledge GraphMultimodal AI

0 likes · 7 min read

Advances in Multimodal Large Models and Document Understanding Presented at the 2024 Global Machine Learning Conference (Beijing)

Sohu Tech Products

Nov 6, 2024 · Artificial Intelligence

RAG2.0 Engine Design Challenges and Implementation

The talk outlines RAG2.0’s design challenges—low vector recall, complex documents, semantic gaps—and presents a two‑stage architecture using deep multimodal understanding and knowledge‑graph‑enhanced retrieval, detailing advanced chunking, multi‑index and multi‑path retrieval, efficient sorting models like ColBERT, and future multi‑modal and memory‑augmented agent directions.

ColBERTDelayed InteractionEnterprise AI

0 likes · 23 min read

RAG2.0 Engine Design Challenges and Implementation

360 Tech Engineering

Jul 3, 2024 · Artificial Intelligence

360LayoutAnalysis: Open‑Source Lightweight Document Layout Analysis Models for Multiple Scenarios

The 360LayoutAnalysis project from 360 AI Lab releases lightweight, yolov8‑based layout analysis models covering Chinese and English papers, Chinese research reports, and a general document scenario, providing fast inference, paragraph‑level detection, and open‑source code and weights for flexible document‑understanding pipelines.

AI modelLayout AnalysisMultimodal

0 likes · 9 min read

360LayoutAnalysis: Open‑Source Lightweight Document Layout Analysis Models for Multiple Scenarios

DataFunSummit

Sep 5, 2023 · Artificial Intelligence

Document Intelligence: Background, Technology Stack, Large‑Model Advances, and Enterprise Applications

This article presents a comprehensive overview of document intelligence, covering its background, the evolution of related technologies, large‑model approaches such as multimodal pre‑training and domain‑specific models, and concrete enterprise use cases across various business functions.

Enterprise AIMultimodal AIdocument intelligence

0 likes · 14 min read

Document Intelligence: Background, Technology Stack, Large‑Model Advances, and Enterprise Applications

AntTech

Aug 25, 2023 · Artificial Intelligence

LayoutGCN: A Lightweight Graph Convolutional Network for Visually Rich Document Understanding

LayoutGCN is a lightweight, graph‑based framework that jointly encodes text, layout, and image features of visually rich documents, achieving competitive performance on multiple downstream tasks while drastically reducing model size and computational cost, making it suitable for edge deployment.

Graph Neural NetworkLayoutGCNdocument understanding

0 likes · 24 min read

LayoutGCN: A Lightweight Graph Convolutional Network for Visually Rich Document Understanding

AntTech

Jul 31, 2023 · Artificial Intelligence

LayoutMask: Enhancing Text-Layout Interaction in Multi-modal Pre-training for Document Understanding

LayoutMask introduces a novel multi-modal pre‑training model that replaces global 1D position with local 1D position and adds Whole Word Masking, Layout‑Aware Masking, and Masked Position Modeling, achieving state‑of‑the‑art results on various visually‑rich document understanding tasks.

AIMultimodal PretrainingNLP

0 likes · 15 min read

LayoutMask: Enhancing Text-Layout Interaction in Multi-modal Pre-training for Document Understanding

DataFunSummit

Apr 7, 2023 · Artificial Intelligence

Comprehensive Overview of OCR: Types, Models, Pre‑training Techniques, and DIY Pipelines on ModelScope

This article provides a detailed introduction to OCR technology, covering its fundamental concepts, major categories (document, scene, and handwritten OCR), typical processing pipelines, a suite of open‑source models on ModelScope—including detection, recognition, and table OCR—and recent multimodal pre‑training methods such as VLDoc and VLPT.

ModelScopeOCRTable OCR

0 likes · 15 min read

Comprehensive Overview of OCR: Types, Models, Pre‑training Techniques, and DIY Pipelines on ModelScope

AntTech

Jun 15, 2022 · Artificial Intelligence

XYLayoutLM: Towards Layout-Aware Multimodal Networks for Visually-Rich Document Understanding

XYLayoutLM introduces a layout‑aware multimodal network that improves visually‑rich document understanding by augmenting XY‑Cut for robust reading order generation and employing a Dilated Conditional Position Encoding to handle variable‑length inputs, achieving state‑of‑the‑art performance on XFUN and FUNSD datasets.

MultimodalVision TransformerXYCut

0 likes · 10 min read

XYLayoutLM: Towards Layout-Aware Multimodal Networks for Visually-Rich Document Understanding

Architects Research Society

Jan 9, 2022 · Artificial Intelligence

Five Key Trends in AI-Powered Search and Unstructured Data Analysis

The article outlines five major trends—neural-network-enhanced search, semantic search, document understanding, image and voice search, and knowledge graphs—that are transforming enterprise use of unstructured data by leveraging AI to deliver precise, context-aware answers and insights.

AIKnowledge GraphSearch

0 likes · 15 min read

Five Key Trends in AI-Powered Search and Unstructured Data Analysis

Architects Research Society

Aug 22, 2020 · Artificial Intelligence

Five Key Trends Shaping Enterprise Search and Unstructured Data Analysis

The article outlines how advances in neural networks, semantic search, document understanding, image and voice search, and knowledge graphs are transforming enterprise search of unstructured data, enabling more accurate, context‑aware answers and new business use cases across organizations.

AIKnowledge Graphdocument understanding

0 likes · 13 min read