Two Weeks of RAG Troubles: How Bad PDF Parsing Made My LLM Look Stupid

After two weeks of failed RAG queries caused by fragmented tables, multi‑column layouts, and poor OCR, the author switched from open‑source PDF parsers to the commercial TextIn xParse engine, boosting retrieval accuracy from under 30% to over 95% and sharing practical integration tips.

Data STUDIO
Data STUDIO
Data STUDIO
Two Weeks of RAG Troubles: How Bad PDF Parsing Made My LLM Look Stupid

1. Document Parsing Is Harder Than You Think

The author expected "document parsing" to be a simple text extraction task, but real‑world PDFs proved far more complex.

Case 1: Multi‑page tables

A four‑page API error‑code table was split into four separate fragments by traditional OCR/PDF libraries (e.g., PyPDF2, pdfplumber), losing the header and making LLM answers guesswork.

Case 2: Multi‑column layout

Technical white‑papers with double‑column designs were read top‑to‑bottom, causing the parser to jump between columns and break semantic flow.

Case 3: Table‑to‑plain‑text conversion

Some libraries turned tables into plain text, losing column boundaries; for example, a product table with IDs, names, prices, and stock required regex to map values, and merged cells caused crashes.

Skewed scans, shadows

Tables without borders, merged cells

Headers/footers, watermarks, handwritten notes

Mixed text‑image, formulas, charts

Open‑source solutions may suffice for personal projects, but production environments encounter one pitfall after another.

2. After Surveying Options, I Tried a Commercial Solution

Initially the author tried to rely on open‑source tools: PyPDF2 (almost no table support), paddleocr (good quality but heavy and still breaks multi‑page tables), and emerging projects like marker / surya (still buggy). A recommendation in a technical chat pointed to TextIn xParse, a commercial document‑parsing engine from Hehe Information.

A comparison image showed the same multi‑page complex table rendered as four broken text blocks by open‑source tools, while TextIn produced a complete Markdown table with proper row/column alignment and preserved merged cells.

3. What TextIn xParse Does Differently

1. Intelligent stitching of multi‑page tables

TextIn detects that a table spans multiple pages, automatically concatenates the fragments, aligns headers, and merges rows. Output can be Markdown or JSON, ready for downstream processing.

2. Restoration of complex table structures

Wireless tables, nested tables, and merged cells—common in business manuals—are reconstructed by understanding the two‑dimensional layout instead of emitting a simple line‑by‑line dump.

3. Document pre‑processing correction

For scanned or phone‑captured PDFs, TextIn automatically corrects skew, removes shadows and noise before OCR, making it friendly to non‑standard PDFs.

4. Handwritten & formula recognition

Handwritten annotations and mathematical formulas are recognised more accurately than with open‑source alternatives.

5. Standardised output ready for downstream

TextIn can emit Markdown, JSON, or HTML. The author chose Markdown, preserving headings, tables, and lists, allowing zero‑clean‑up ingestion into LangChain or RAGFlow.

4. Code Demo: TextIn + LangChain for RAG

The following minimal code integrates TextIn with LangChain.

from langchain_xparse import XParseLoader

loader = XParseLoader(
    file_path="technical_manual.pdf",
    api_key="your_api_key"
)

docs = loader.load()  # Directly get structured Markdown
# Then normal RAG flow...

After adding a vector store, retrieval accuracy rose from below 30 % to over 95 %.

5. Advanced RAG Tips (Practical Tricks)

Tip 1: Metadata filtering – retrieve with conditions

Keep metadata (source file, chapter, page, table ID) during parsing and filter on it at query time.

for doc in docs:
    doc.metadata["source"] = "2026_Q2_report.pdf"
    doc.metadata["table_id"] = "table_3_2"

retriever = vectorstore.as_retriever(
    search_kwargs={"filter": {"table_id": "table_3_2"}}
)

Effect: querying “East China Q2 failure rate” only searches table 3‑2.

Tip 2: Multi‑stage routing – summary then detail

For very long manuals, first retrieve relevant chapters via their summary vectors, then perform fine‑grained search inside the selected chapter (e.g., using ParentDocumentRetriever or separate summary and chunk stores).

Tip 3: Structured table retrieval – let LLM generate SQL

After TextIn outputs a Markdown table, store it as CSV/JSON in SQLite or DuckDB. Let the LLM produce a SQL statement for aggregation queries, execute it, and return precise results.

Insert table data into a database.

LLM generates SQL from the user question.

Execute SQL and return the answer.

Tip 4: Rerank for higher top‑K precision

Apply a rerank model (e.g., Cohere Rerank, BGE‑reranker) to the top‑K retrieved chunks to improve relevance.

6. Open‑source vs Commercial: A Real‑World Choice

Open‑source fits personal learning, clean PDFs, and when you have time to debug. Commercial solutions like TextIn suit production, messy documents (cross‑page tables, scans, handwritten notes), and when you prefer to focus on business logic rather than data‑cleaning.

Open‑source is not truly free; debugging time and error‑induced business loss become hidden costs.

7. Practical Advice for Fellow Engineers

Dump and inspect parsed results; never trust a parser blindly.

Prefer tools that output Markdown to preserve structure for LLMs.

Keep table data in its two‑dimensional form; pure text makes precise row/column queries impossible.

Align chunking strategy with document hierarchy; MarkdownHeaderTextSplitter works smoothly with Markdown output.

In production, hand the parsing problem to specialised services to maximise ROI.

Conclusion

The RAG system now answers “Table 3‑2 East China Q2 failure rate” with “3.2 %” and a citation, simply because the underlying document parser was replaced.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

AILangChainRAGPDF parsingTextIn
Data STUDIO
Written by

Data STUDIO

Click to receive the "Python Study Handbook"; reply "benefit" in the chat to get it. Data STUDIO focuses on original data science articles, centered on Python, covering machine learning, data analysis, visualization, MySQL and other practical knowledge and project case studies.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.