Two Weeks of RAG Troubles: How Bad PDF Parsing Made My LLM Look Stupid
After two weeks of failed RAG queries caused by fragmented tables, multi‑column layouts, and poor OCR, the author switched from open‑source PDF parsers to the commercial TextIn xParse engine, boosting retrieval accuracy from under 30% to over 95% and sharing practical integration tips.
1. Document Parsing Is Harder Than You Think
The author expected "document parsing" to be a simple text extraction task, but real‑world PDFs proved far more complex.
Case 1: Multi‑page tables
A four‑page API error‑code table was split into four separate fragments by traditional OCR/PDF libraries (e.g., PyPDF2, pdfplumber), losing the header and making LLM answers guesswork.
Case 2: Multi‑column layout
Technical white‑papers with double‑column designs were read top‑to‑bottom, causing the parser to jump between columns and break semantic flow.
Case 3: Table‑to‑plain‑text conversion
Some libraries turned tables into plain text, losing column boundaries; for example, a product table with IDs, names, prices, and stock required regex to map values, and merged cells caused crashes.
Skewed scans, shadows
Tables without borders, merged cells
Headers/footers, watermarks, handwritten notes
Mixed text‑image, formulas, charts
Open‑source solutions may suffice for personal projects, but production environments encounter one pitfall after another.
2. After Surveying Options, I Tried a Commercial Solution
Initially the author tried to rely on open‑source tools: PyPDF2 (almost no table support), paddleocr (good quality but heavy and still breaks multi‑page tables), and emerging projects like marker / surya (still buggy). A recommendation in a technical chat pointed to TextIn xParse, a commercial document‑parsing engine from Hehe Information.
A comparison image showed the same multi‑page complex table rendered as four broken text blocks by open‑source tools, while TextIn produced a complete Markdown table with proper row/column alignment and preserved merged cells.
3. What TextIn xParse Does Differently
1. Intelligent stitching of multi‑page tables
TextIn detects that a table spans multiple pages, automatically concatenates the fragments, aligns headers, and merges rows. Output can be Markdown or JSON, ready for downstream processing.
2. Restoration of complex table structures
Wireless tables, nested tables, and merged cells—common in business manuals—are reconstructed by understanding the two‑dimensional layout instead of emitting a simple line‑by‑line dump.
3. Document pre‑processing correction
For scanned or phone‑captured PDFs, TextIn automatically corrects skew, removes shadows and noise before OCR, making it friendly to non‑standard PDFs.
4. Handwritten & formula recognition
Handwritten annotations and mathematical formulas are recognised more accurately than with open‑source alternatives.
5. Standardised output ready for downstream
TextIn can emit Markdown, JSON, or HTML. The author chose Markdown, preserving headings, tables, and lists, allowing zero‑clean‑up ingestion into LangChain or RAGFlow.
4. Code Demo: TextIn + LangChain for RAG
The following minimal code integrates TextIn with LangChain.
from langchain_xparse import XParseLoader
loader = XParseLoader(
file_path="technical_manual.pdf",
api_key="your_api_key"
)
docs = loader.load() # Directly get structured Markdown
# Then normal RAG flow...After adding a vector store, retrieval accuracy rose from below 30 % to over 95 %.
5. Advanced RAG Tips (Practical Tricks)
Tip 1: Metadata filtering – retrieve with conditions
Keep metadata (source file, chapter, page, table ID) during parsing and filter on it at query time.
for doc in docs:
doc.metadata["source"] = "2026_Q2_report.pdf"
doc.metadata["table_id"] = "table_3_2"
retriever = vectorstore.as_retriever(
search_kwargs={"filter": {"table_id": "table_3_2"}}
)Effect: querying “East China Q2 failure rate” only searches table 3‑2.
Tip 2: Multi‑stage routing – summary then detail
For very long manuals, first retrieve relevant chapters via their summary vectors, then perform fine‑grained search inside the selected chapter (e.g., using ParentDocumentRetriever or separate summary and chunk stores).
Tip 3: Structured table retrieval – let LLM generate SQL
After TextIn outputs a Markdown table, store it as CSV/JSON in SQLite or DuckDB. Let the LLM produce a SQL statement for aggregation queries, execute it, and return precise results.
Insert table data into a database.
LLM generates SQL from the user question.
Execute SQL and return the answer.
Tip 4: Rerank for higher top‑K precision
Apply a rerank model (e.g., Cohere Rerank, BGE‑reranker) to the top‑K retrieved chunks to improve relevance.
6. Open‑source vs Commercial: A Real‑World Choice
Open‑source fits personal learning, clean PDFs, and when you have time to debug. Commercial solutions like TextIn suit production, messy documents (cross‑page tables, scans, handwritten notes), and when you prefer to focus on business logic rather than data‑cleaning.
Open‑source is not truly free; debugging time and error‑induced business loss become hidden costs.
7. Practical Advice for Fellow Engineers
Dump and inspect parsed results; never trust a parser blindly.
Prefer tools that output Markdown to preserve structure for LLMs.
Keep table data in its two‑dimensional form; pure text makes precise row/column queries impossible.
Align chunking strategy with document hierarchy; MarkdownHeaderTextSplitter works smoothly with Markdown output.
In production, hand the parsing problem to specialised services to maximise ROI.
Conclusion
The RAG system now answers “Table 3‑2 East China Q2 failure rate” with “3.2 %” and a citation, simply because the underlying document parser was replaced.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Data STUDIO
Click to receive the "Python Study Handbook"; reply "benefit" in the chat to get it. Data STUDIO focuses on original data science articles, centered on Python, covering machine learning, data analysis, visualization, MySQL and other practical knowledge and project case studies.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
