Tagged articles
23 articles
Page 1 of 1
DataFunTalk
DataFunTalk
May 4, 2026 · Artificial Intelligence

Engineering and Algorithm Innovations for RAG Engines in Office Applications

This article analyzes the challenges and practical solutions of building a Retrieval‑Augmented Generation (RAG) system for office scenarios, covering background issues, modular architecture, offline and online pipelines, hybrid retrieval, ranking models, knowledge filtering, prompt design, and two‑stage generation techniques.

AIDocument ParsingHybrid Retrieval
0 likes · 22 min read
Engineering and Algorithm Innovations for RAG Engines in Office Applications
DataFunTalk
DataFunTalk
Apr 21, 2026 · Artificial Intelligence

Will Multimodal GraphRAG Revolutionize Document Intelligence? A Technical Deep Dive

This article provides a comprehensive technical analysis of multimodal GraphRAG, detailing document intelligent parsing pipelines, multimodal graph construction, retrieval generation, and the role of knowledge graphs in enhancing chunk relationships, while comparing traditional RAG, GraphRAG, and KG‑QA approaches.

AIDocument ParsingKnowledge Graph
0 likes · 26 min read
Will Multimodal GraphRAG Revolutionize Document Intelligence? A Technical Deep Dive
Wu Shixiong's Large Model Academy
Wu Shixiong's Large Model Academy
Mar 22, 2026 · Artificial Intelligence

How to Overcome MinerU’s Top 9 Limitations for Reliable Document Parsing

This article examines MinerU’s strengths and nine critical shortcomings—such as reading order errors, split tables, merged cells, OCR misrecognition, formula handling, heading hierarchy loss, output inconsistency, hardware limits, and licensing issues—and provides concrete improvement strategies and interview‑ready talking points for engineers.

Document ParsingInterview TipsMinerU
0 likes · 12 min read
How to Overcome MinerU’s Top 9 Limitations for Reliable Document Parsing
Wu Shixiong's Large Model Academy
Wu Shixiong's Large Model Academy
Mar 7, 2026 · Artificial Intelligence

Mastering Offline Document Parsing for RAG: From PDFs to Multimodal Knowledge Bases

This article provides a comprehensive guide to offline document parsing for Retrieval‑Augmented Generation, covering multi‑format extraction, layout analysis, OCR pitfalls, chunking strategies, hierarchical metadata tagging, and how these steps directly affect retrieval accuracy and overall RAG performance.

Document ParsingRAGmetadata
0 likes · 14 min read
Mastering Offline Document Parsing for RAG: From PDFs to Multimodal Knowledge Bases
High Availability Architecture
High Availability Architecture
Mar 6, 2026 · Artificial Intelligence

How to Trim Massive JSON Outputs for Real‑World AI Agents

The article explains why raw JSON from document‑parsing APIs overwhelms an AI agent's context window and presents a practical workflow that separates readable Markdown content from metadata, uses prompt engineering, and leverages sandboxed code to keep agents efficient and accurate.

AI agentsDocument ParsingPrompt engineering
0 likes · 11 min read
How to Trim Massive JSON Outputs for Real‑World AI Agents
DataFunTalk
DataFunTalk
Feb 26, 2026 · Artificial Intelligence

How RAG Can Overcome Large‑Model Pitfalls in Enterprise Knowledge Work

This article explains the challenges large language models face in real‑world applications, introduces Retrieval‑Augmented Generation (RAG) as a solution, and details a modular RAG architecture, its components, and practical techniques for document parsing, query rewriting, hybrid retrieval, ranking, and answer generation in an enterprise setting.

Document ParsingLLM deploymentRAG
0 likes · 22 min read
How RAG Can Overcome Large‑Model Pitfalls in Enterprise Knowledge Work
Tech Freedom Circle
Tech Freedom Circle
Jan 5, 2026 · Artificial Intelligence

A Three‑Step Guide to Mastering RAG Semantic‑Loss Interview Questions

RAG (Retrieval‑Augmented Generation) is a hot interview topic, and many candidates stumble on semantic‑loss issues; this article dissects a real JD interview case, identifies three core shortcomings, and presents a three‑step technical solution—structure restoration, semantic splitting, and hybrid retrieval—plus a ready‑to‑use answer template.

AI InterviewDocument ParsingHybrid Search
0 likes · 25 min read
A Three‑Step Guide to Mastering RAG Semantic‑Loss Interview Questions
Java Captain
Java Captain
Jan 3, 2026 · Backend Development

Integrate Apache Tika with Spring Boot for Powerful Document Parsing

This guide shows how to integrate Apache Tika into a Spring Boot application by adding Maven dependencies, configuring a tika-config.xml file, creating a @Configuration class that provides a Tika bean, and using the bean to detect, translate, and parse various document formats.

Apache TikaBackend DevelopmentDocument Parsing
0 likes · 5 min read
Integrate Apache Tika with Spring Boot for Powerful Document Parsing
phodal
phodal
Nov 27, 2025 · Artificial Intelligence

How AutoDev’s Agentic RAG Turns Docs into a Programmable Knowledge Base

This article explains how AutoDev builds an Agentic Retrieval‑Augmented Generation system with a Document Query Language (DocQL) that lets LLM agents navigate hierarchical code and documentation structures using JSONPath‑like queries, detailing implementation, multi‑level keyword expansion, and experimental findings.

AIAgentic RAGDocQL
0 likes · 12 min read
How AutoDev’s Agentic RAG Turns Docs into a Programmable Knowledge Base
Tencent Technical Engineering
Tencent Technical Engineering
Sep 12, 2025 · Artificial Intelligence

How POINTS-Reader Achieves State‑of‑the‑Art PDF Extraction Without Teacher Models

The POINTS-Reader paper, accepted at EMNLP 2025, introduces a two‑stage, fully automated data generation pipeline that enables a lightweight visual‑language model to extract text, tables, and LaTeX formulas from diverse PDF layouts with superior performance and high throughput, all without relying on costly teacher‑model distillation.

AIDocument ParsingOCR
0 likes · 12 min read
How POINTS-Reader Achieves State‑of‑the‑Art PDF Extraction Without Teacher Models
Xiaohongshu Tech REDtech
Xiaohongshu Tech REDtech
Jul 31, 2025 · Artificial Intelligence

How dots.ocr Achieves SOTA Multilingual Document Parsing with a 1.7B VLM

dots.ocr is a 1.7 billion-parameter multilingual document-parsing model that unifies layout detection and content recognition within a single visual-language model, delivering state-of-the-art performance across text, tables, formulas and reading order while remaining efficient and extensible for future multimodal AI research.

AIBenchmarkDocument Parsing
0 likes · 10 min read
How dots.ocr Achieves SOTA Multilingual Document Parsing with a 1.7B VLM
Architect's Guide
Architect's Guide
Jan 23, 2025 · Backend Development

Integrating Apache Tika with Spring Boot for Document Parsing

This article demonstrates how to add Apache Tika dependencies to a Spring Boot project, configure tika-config.xml, create a Java configuration class, and use the injected Tika bean to detect, translate, and parse various document formats such as PDF, PPT, and XLS.

Apache TikaConfigurationDocument Parsing
0 likes · 6 min read
Integrating Apache Tika with Spring Boot for Document Parsing
Spring Full-Stack Practical Cases
Spring Full-Stack Practical Cases
Oct 31, 2024 · Backend Development

Master Document Parsing in Spring Boot 3 with Apache Tika: Code Samples & Tips

This article introduces Apache Tika for document parsing, outlines its key advantages, and provides step‑by‑step Spring Boot 3 examples—including facade parsing, text, PDF, auto‑detect, HTML conversion, custom configuration, and file‑upload integration—complete with code snippets and output screenshots.

Apache TikaAutoDetectParserDocument Parsing
0 likes · 10 min read
Master Document Parsing in Spring Boot 3 with Apache Tika: Code Samples & Tips
Alibaba Cloud Big Data AI Platform
Alibaba Cloud Big Data AI Platform
Sep 2, 2024 · Artificial Intelligence

Turning PDFs and Word Docs into Searchable Knowledge for RAG Systems

This article explains why generic large language models struggle with domain‑specific data, introduces Retrieval‑Augmented Generation (RAG) as a solution, compares Word and PDF formats, outlines document‑parsing pipelines, reviews open‑source PDF tools, and presents Alibaba Cloud's rule‑based parsing architecture with performance results.

AIDocument ParsingLLM
0 likes · 13 min read
Turning PDFs and Word Docs into Searchable Knowledge for RAG Systems
Full-Stack Cultivation Path
Full-Stack Cultivation Path
Aug 8, 2024 · Artificial Intelligence

MegaParse: A Precision Document Parser Built for LLMs

MegaParse is an open‑source document parser that transforms PDFs, Word, PPT, Excel and CSV files into LLM‑friendly formats, preserving full information, boosting processing efficiency, and enabling deeper semantic analysis, with quick‑start installation steps and a roadmap for future features.

AI toolsDocument ParsingLLM
0 likes · 4 min read
MegaParse: A Precision Document Parser Built for LLMs
AI Large Model Application Practice
AI Large Model Application Practice
Jul 4, 2024 · Artificial Intelligence

Mastering Multimodal RAG: From PDF Parsing to Advanced Query Rewriting

This article explains how to handle complex multimodal PDFs in RAG systems, outlines extraction, indexing, and multimodal model integration, details four query‑rewriting strategies (HyDE, stepwise, sub‑question, backward), and presents key evaluation metrics and tools for assessing RAG performance.

Document ParsingQuery RewritingRAG
0 likes · 12 min read
Mastering Multimodal RAG: From PDF Parsing to Advanced Query Rewriting
Java High-Performance Architecture
Java High-Performance Architecture
Jun 7, 2024 · Backend Development

How to Parse Documents in Spring Boot with Apache Tika

Learn how to integrate Apache Tika into a Spring Boot application to parse a wide range of document formats, including the necessary Maven dependencies, XML configuration, custom configuration class, and usage examples, enabling efficient content extraction and processing within your Java backend.

Apache TikaBackend DevelopmentDocument Parsing
0 likes · 5 min read
How to Parse Documents in Spring Boot with Apache Tika
Java Tech Enthusiast
Java Tech Enthusiast
Mar 3, 2024 · Backend Development

Integrating Apache Tika with Spring Boot for Document Parsing

This guide demonstrates how to add Apache Tika to a Spring Boot project by declaring the tika‑bom, core and parser dependencies, providing a custom tika‑config.xml, creating a @Configuration class that builds a Tika bean, and then injecting the bean to detect, parse, or translate documents.

Apache TikaConfigurationDocument Parsing
0 likes · 5 min read
Integrating Apache Tika with Spring Boot for Document Parsing
DataFunSummit
DataFunSummit
Jan 23, 2023 · Artificial Intelligence

Intelligent Document Processing: Core Technologies, Techniques, and Practical Insights

This article explains intelligent document processing (IDP) by describing its core components—OCR, document parsing, and information extraction—detailing various OCR and text‑detection algorithms, discussing document layout reconstruction, table parsing, domain‑specific model adaptation, system optimization, and productization challenges, and outlining future research directions.

AIDocument ParsingInformation Extraction
0 likes · 27 min read
Intelligent Document Processing: Core Technologies, Techniques, and Practical Insights