Tagged articles
6 articles
Page 1 of 1
AI Engineer Programming
AI Engineer Programming
May 9, 2026 · Artificial Intelligence

Why PDF Parsing Is Hard for RAG and Which Mainstream Solutions Work

The article examines the intrinsic challenges of extracting structured text from PDFs for Retrieval‑Augmented Generation—such as missing reading order, table reconstruction, font encoding, and scanned images—and compares lightweight libraries, AI‑enhanced frameworks, commercial APIs, and visual language models as practical solutions.

AI frameworksOCRPDF parsing
0 likes · 23 min read
Why PDF Parsing Is Hard for RAG and Which Mainstream Solutions Work
Data STUDIO
Data STUDIO
Apr 9, 2026 · Artificial Intelligence

Two Weeks of RAG Troubles: How Bad PDF Parsing Made My LLM Look Stupid

After two weeks of failed RAG queries caused by fragmented tables, multi‑column layouts, and poor OCR, the author switched from open‑source PDF parsers to the commercial TextIn xParse engine, boosting retrieval accuracy from under 30% to over 95% and sharing practical integration tips.

AILangChainPDF parsing
0 likes · 12 min read
Two Weeks of RAG Troubles: How Bad PDF Parsing Made My LLM Look Stupid
Fun with Large Models
Fun with Large Models
Nov 30, 2025 · Artificial Intelligence

Multimodal RAG with LangChain: PDF Parsing, Chunking, and Citation Guide

This article walks through building a LangChain‑based multimodal RAG system that parses PDFs (both native and scanned), splits them into semantic chunks, stores embeddings in a vector database, and generates answers with precise source citations, complete with code samples and API integration.

FastAPILangChainMultimodal AI
0 likes · 20 min read
Multimodal RAG with LangChain: PDF Parsing, Chunking, and Citation Guide
Lobster Programming
Lobster Programming
Nov 1, 2024 · Backend Development

How to Parse PDFs and Extract Metadata with Apache Tika and Spring Boot

This guide explains Apache Tika's document parsing capabilities, shows how to download and run the Tika app, demonstrates extracting text and metadata from a PDF, and provides step‑by‑step instructions for integrating Tika into a Spring Boot project with full code examples.

Apache TikaDocument ProcessingJava
0 likes · 7 min read
How to Parse PDFs and Extract Metadata with Apache Tika and Spring Boot
JD Tech
JD Tech
Jun 7, 2024 · Artificial Intelligence

Automated Test Case Generation Using LangChain, Vector Databases, and Large Language Models

This article presents a practical approach to automatically generate software test cases by leveraging LangChain, PDF parsing, vector‑database retrieval, and large language models, comparing it with existing tools, detailing implementation steps, code examples, experimental results, and future improvement directions.

LLMLangChainPDF parsing
0 likes · 14 min read
Automated Test Case Generation Using LangChain, Vector Databases, and Large Language Models