Tagged articles

PDF parsing

7 articles · Page 1 of 1

Jun 2, 2026 · Fundamentals

Lightning‑Fast Open‑Source Local PDF Parser: LiteParse Processes 400‑Page PDFs in 1 Second

LiteParse, an open‑source Rust‑based local PDF parser from the LlamaIndex team, extracts text from a 400‑page PDF in about one second, offers multi‑language bindings, flexible OCR, bounding‑box output, and Agent Skill integration, while its limitations include basic table handling and complex layout support.

Agent SkillLiteParseLocal processing

0 likes · 9 min read

Lightning‑Fast Open‑Source Local PDF Parser: LiteParse Processes 400‑Page PDFs in 1 Second

AI Engineer Programming

May 9, 2026 · Artificial Intelligence

Why PDF Parsing Is Hard for RAG and Which Mainstream Solutions Work

The article examines the intrinsic challenges of extracting structured text from PDFs for Retrieval‑Augmented Generation—such as missing reading order, table reconstruction, font encoding, and scanned images—and compares lightweight libraries, AI‑enhanced frameworks, commercial APIs, and visual language models as practical solutions.

AI frameworksOCRPDF parsing

0 likes · 23 min read

Why PDF Parsing Is Hard for RAG and Which Mainstream Solutions Work

Data STUDIO

Apr 9, 2026 · Artificial Intelligence

Two Weeks of RAG Troubles: How Bad PDF Parsing Made My LLM Look Stupid

After two weeks of failed RAG queries caused by fragmented tables, multi‑column layouts, and poor OCR, the author switched from open‑source PDF parsers to the commercial TextIn xParse engine, boosting retrieval accuracy from under 30% to over 95% and sharing practical integration tips.

AILangChainPDF parsing

0 likes · 12 min read

Two Weeks of RAG Troubles: How Bad PDF Parsing Made My LLM Look Stupid

Fun with Large Models

Nov 30, 2025 · Artificial Intelligence

Multimodal RAG with LangChain: PDF Parsing, Chunking, and Citation Guide

This article walks through building a LangChain‑based multimodal RAG system that parses PDFs (both native and scanned), splits them into semantic chunks, stores embeddings in a vector database, and generates answers with precise source citations, complete with code samples and API integration.

FastAPILangChainMultimodal AI

0 likes · 20 min read

Multimodal RAG with LangChain: PDF Parsing, Chunking, and Citation Guide

Architect

May 23, 2025 · Artificial Intelligence

How We Won the RAG Challenge: Multi‑Router & Dynamic Knowledge Base Techniques Revealed

This article details the end‑to‑end design, parsing tricks, vector database setup, retrieval strategies, prompt engineering, and LLM reranking that powered the winning solution in a company‑annual‑report question‑answering competition.

FAISSLLMPDF parsing

0 likes · 37 min read

How We Won the RAG Challenge: Multi‑Router & Dynamic Knowledge Base Techniques Revealed

Lobster Programming

Nov 1, 2024 · Backend Development

How to Parse PDFs and Extract Metadata with Apache Tika and Spring Boot

This guide explains Apache Tika's document parsing capabilities, shows how to download and run the Tika app, demonstrates extracting text and metadata from a PDF, and provides step‑by‑step instructions for integrating Tika into a Spring Boot project with full code examples.

Apache TikaDocument processingJava

0 likes · 7 min read

How to Parse PDFs and Extract Metadata with Apache Tika and Spring Boot

JD Tech

Jun 7, 2024 · Artificial Intelligence

Automated Test Case Generation Using LangChain, Vector Databases, and Large Language Models

This article presents a practical approach to automatically generate software test cases by leveraging LangChain, PDF parsing, vector‑database retrieval, and large language models, comparing it with existing tools, detailing implementation steps, code examples, experimental results, and future improvement directions.

LLMLangChainPDF parsing

0 likes · 14 min read

Automated Test Case Generation Using LangChain, Vector Databases, and Large Language Models