Tagged articles
20 articles
Page 1 of 1
Machine Heart
Machine Heart
Apr 29, 2026 · Artificial Intelligence

Doc‑V*: Reading Only 5 Pages Beats RAG on 80‑Page Docs – 10 Key Insights

Doc‑V* introduces a dynamic, thumbnail‑driven approach that lets a model decide which pages to read, achieving a 49.7% improvement over RAG variants on multi‑page document QA benchmarks without larger models or longer context windows, and demonstrates how strategic evidence acquisition outperforms naïve full‑document reading.

AIRAGdocument understanding
0 likes · 10 min read
Doc‑V*: Reading Only 5 Pages Beats RAG on 80‑Page Docs – 10 Key Insights
Old Zhang's AI Learning
Old Zhang's AI Learning
Jan 31, 2026 · Artificial Intelligence

How a 0.1B‑Parameter OCR Model Beats Multi‑Billion‑Parameter Vision‑Language Models

UniRec‑0.1B, a lightweight OCR model with only 0.1 B parameters, achieves accuracy comparable to or better than multi‑billion‑parameter visual‑language models across text, formula, and mixed‑content tasks, thanks to hierarchical supervision training, a semantic‑decoupled tokenizer, and a large 40 M‑sample dataset, while delivering 2‑9× faster inference and full open‑source availability.

Hierarchical SupervisionOCRSemantic Decoupled Tokenizer
0 likes · 12 min read
How a 0.1B‑Parameter OCR Model Beats Multi‑Billion‑Parameter Vision‑Language Models
PaperAgent
PaperAgent
Jan 27, 2026 · Artificial Intelligence

How DeepSeek-OCR 2’s Dual-Flow Attention Redefines Document Understanding

DeepSeek-OCR 2 introduces a novel dual‑stream (bidirectional + causal) attention architecture that replaces fixed raster scanning, leverages a Qwen2‑0.5B encoder, and achieves state‑of‑the‑art accuracy on OmniDocBench while reducing token budget and improving reading‑order consistency.

DeepEncoderDeepSeekDual-Stream Attention
0 likes · 8 min read
How DeepSeek-OCR 2’s Dual-Flow Attention Redefines Document Understanding
AntTech
AntTech
Apr 10, 2025 · Artificial Intelligence

Ant Group Presents Four AI Research Papers at ICLR 2025 Live Showcase

At the ICLR 2025 live session in Singapore, Ant Group showcased four cutting‑edge papers—CodePlan, Animate‑X, Group Position Embedding, and OmniKV—demonstrating advances in large‑language‑model reasoning, universal character animation, layout‑aware document understanding, and efficient long‑context inference.

AI researchdocument understandinglarge language models
0 likes · 6 min read
Ant Group Presents Four AI Research Papers at ICLR 2025 Live Showcase
AI Frontier Lectures
AI Frontier Lectures
Mar 7, 2025 · Artificial Intelligence

Can Mistral’s New OCR Model Really Beat the Competition? A Deep Dive

Mistral AI’s newly launched OCR API claims to deliver world‑class document understanding with multilingual support, high speed, and self‑hosting options, and benchmark tests show it outperforms Azure OCR and Google Doc AI, yet independent evaluations reveal limitations on complex tables and legal forms, prompting a balanced assessment of its readiness for enterprise use.

AI modelBenchmarkMistral AI
0 likes · 7 min read
Can Mistral’s New OCR Model Really Beat the Competition? A Deep Dive
DataFunSummit
DataFunSummit
Feb 21, 2025 · Artificial Intelligence

Multimodal Retrieval‑Augmented Generation (RAG): Implementation Paths and Future Prospects

This article explores multimodal Retrieval‑Augmented Generation (RAG), detailing five core topics—including semantic extraction, visual‑language models, scaling strategies, technical roadmap choices, and a Q&A—while presenting three implementation pathways, performance evaluations, and future directions for AI‑driven document understanding.

Multimodal AIRAGTensor Retrieval
0 likes · 11 min read
Multimodal Retrieval‑Augmented Generation (RAG): Implementation Paths and Future Prospects
Baidu Geek Talk
Baidu Geek Talk
Jan 6, 2025 · Information Security

MarkupLM-based Detection of Malicious Content Scraping

The article presents a MarkupLM‑based approach that enriches BERT with XPath embeddings to jointly model webpage text and structure, enabling site‑level detection of malicious content‑scraping pages that bypass traditional rule‑based filters and demonstrating the critical role of structural cues in improving spam classification accuracy.

MarkupLMXPath embeddingcontent scraping detection
0 likes · 16 min read
MarkupLM-based Detection of Malicious Content Scraping
NewBeeNLP
NewBeeNLP
Jan 2, 2025 · Artificial Intelligence

Unlocking Multimodal RAG: From Semantic Extraction to Scalable VLM Solutions

This article examines the implementation paths and future prospects of multimodal Retrieval‑Augmented Generation, covering semantic extraction, transformer‑based OCR, visual language models, scaling challenges, tensor indexing, and practical evaluations with tools like Infinity and ColPali.

AI RetrievalInfinity DatabaseMultimodal RAG
0 likes · 12 min read
Unlocking Multimodal RAG: From Semantic Extraction to Scalable VLM Solutions
360 Tech Engineering
360 Tech Engineering
Nov 15, 2024 · Artificial Intelligence

Advances in Multimodal Large Models and Document Understanding Presented at the 2024 Global Machine Learning Conference (Beijing)

At the 2024 Global Machine Learning Conference in Beijing, 360 AI Research Institute showcased cutting‑edge multimodal large‑model research, fine‑grained open‑world object detection, and document understanding technologies, highlighting open‑source releases, real‑world deployments, and competitive achievements in AI competitions.

AI researchKnowledge GraphMultimodal AI
0 likes · 7 min read
Advances in Multimodal Large Models and Document Understanding Presented at the 2024 Global Machine Learning Conference (Beijing)
Sohu Tech Products
Sohu Tech Products
Nov 6, 2024 · Artificial Intelligence

RAG2.0 Engine Design Challenges and Implementation

The talk outlines RAG2.0’s design challenges—low vector recall, complex documents, semantic gaps—and presents a two‑stage architecture using deep multimodal understanding and knowledge‑graph‑enhanced retrieval, detailing advanced chunking, multi‑index and multi‑path retrieval, efficient sorting models like ColBERT, and future multi‑modal and memory‑augmented agent directions.

ColBERTDelayed InteractionEnterprise AI
0 likes · 23 min read
RAG2.0 Engine Design Challenges and Implementation
360 Tech Engineering
360 Tech Engineering
Jul 3, 2024 · Artificial Intelligence

360LayoutAnalysis: Open‑Source Lightweight Document Layout Analysis Models for Multiple Scenarios

The 360LayoutAnalysis project from 360 AI Lab releases lightweight, yolov8‑based layout analysis models covering Chinese and English papers, Chinese research reports, and a general document scenario, providing fast inference, paragraph‑level detection, and open‑source code and weights for flexible document‑understanding pipelines.

AI modelLayout AnalysisYOLOv8
0 likes · 9 min read
360LayoutAnalysis: Open‑Source Lightweight Document Layout Analysis Models for Multiple Scenarios
DataFunSummit
DataFunSummit
Sep 5, 2023 · Artificial Intelligence

Document Intelligence: Background, Technology Stack, Large‑Model Advances, and Enterprise Applications

This article presents a comprehensive overview of document intelligence, covering its background, the evolution of related technologies, large‑model approaches such as multimodal pre‑training and domain‑specific models, and concrete enterprise use cases across various business functions.

Document IntelligenceEnterprise AIMultimodal AI
0 likes · 14 min read
Document Intelligence: Background, Technology Stack, Large‑Model Advances, and Enterprise Applications
AntTech
AntTech
Aug 25, 2023 · Artificial Intelligence

LayoutGCN: A Lightweight Graph Convolutional Network for Visually Rich Document Understanding

LayoutGCN is a lightweight, graph‑based framework that jointly encodes text, layout, and image features of visually rich documents, achieving competitive performance on multiple downstream tasks while drastically reducing model size and computational cost, making it suitable for edge deployment.

Graph Neural NetworkLayoutGCNdocument understanding
0 likes · 24 min read
LayoutGCN: A Lightweight Graph Convolutional Network for Visually Rich Document Understanding
DataFunSummit
DataFunSummit
Apr 7, 2023 · Artificial Intelligence

Comprehensive Overview of OCR: Types, Models, Pre‑training Techniques, and DIY Pipelines on ModelScope

This article provides a detailed introduction to OCR technology, covering its fundamental concepts, major categories (document, scene, and handwritten OCR), typical processing pipelines, a suite of open‑source models on ModelScope—including detection, recognition, and table OCR—and recent multimodal pre‑training methods such as VLDoc and VLPT.

ModelScopeOCRTable OCR
0 likes · 15 min read
Comprehensive Overview of OCR: Types, Models, Pre‑training Techniques, and DIY Pipelines on ModelScope
AntTech
AntTech
Jun 15, 2022 · Artificial Intelligence

XYLayoutLM: Towards Layout-Aware Multimodal Networks for Visually-Rich Document Understanding

XYLayoutLM introduces a layout‑aware multimodal network that improves visually‑rich document understanding by augmenting XY‑Cut for robust reading order generation and employing a Dilated Conditional Position Encoding to handle variable‑length inputs, achieving state‑of‑the‑art performance on XFUN and FUNSD datasets.

Vision TransformerXYCutdocument understanding
0 likes · 10 min read
XYLayoutLM: Towards Layout-Aware Multimodal Networks for Visually-Rich Document Understanding