Turning PDFs and Word Docs into Searchable Knowledge for RAG Systems
This article explains why generic large language models struggle with domain‑specific data, introduces Retrieval‑Augmented Generation (RAG) as a solution, compares Word and PDF formats, outlines document‑parsing pipelines, reviews open‑source PDF tools, and presents Alibaba Cloud's rule‑based parsing architecture with performance results.
Background
Although general large language models (LLMs) excel at knowledge‑question answering, they cannot answer domain‑specific queries because the specialized data is not publicly available and the models have never seen it. Fine‑tuning LLMs on private data is costly, so Retrieval‑Augmented Generation (RAG) adds relevant private data to the user query before feeding it to a generic LLM, improving answer quality.
Why Document Parsing Matters
RAG requires a searchable knowledge base. Most professional documents are stored as unstructured PDFs or Word files, which contain titles, paragraphs, tables, and images that are easy for humans but hard for computers. Converting these files into semi‑structured formats (e.g., markdown or HTML) enables slicing, vectorisation, and efficient retrieval.
Word vs. PDF
Word (DOCX) is edit‑oriented, follows the Office Open XML standard, stores content as XML with concepts like headings, paragraphs, and tables, but lacks explicit page or position information.
PDF is read‑oriented, stores drawing commands that fix the visual layout on a page, preserving exact positions but lacking structural concepts such as headings or tables.
Examples of DOCX and PDF internal structures are shown in the original code snippets.
DOCX Parsing
A DOCX file is a zip archive containing XML files. The main content resides in word/document.xml. Key tags include the root element, body, paragraph ( <w:p>), run ( <w:r>), text ( <w:t>), and section properties.
DOC Parsing
Legacy DOC files are OLE compound documents. They lack high‑level structural tags, so the common practice is to convert DOC to DOCX (e.g., via LibreOffice) before parsing.
Open‑Source PDF Tools
Popular Python libraries for PDF processing include:
PDFMiner – powerful but complex API.
PyPDF – lightweight, easy for basic tasks.
PyMuPDF (fitz) – fast, full‑featured.
PDFPlumber – built on PDFMiner, excels at table extraction.
Camelot – visual table extraction.
Papermage – wraps PDFPlumber and adds deep‑learning layout analysis.
Papermage Pipeline
Pure Text Extraction : Use PDFPlumber to obtain words and detect lines.
Visual Annotation : Rasterise each page, run an EfficientDet object detector (via layoutparser) to obtain blocks with bounding boxes and labels (e.g., image, table).
Character‑Level Annotation : Feed words, block IDs, line IDs, and visual labels to an I‑VILA layout model, which predicts element types such as Title, Author, Abstract, Paragraph, Table, Figure, etc.
Sample JSON input to the model:
{
"words": ["word1", "word2", ...],
"block_ids": [0,0,0,1,...],
"line_ids": [0,1,1,2,...],
"labels": [0,0,0,1,...]
}Sample output mapping IDs to element names is also provided.
Key Challenges
Layout Element Recovery : PDFs lack concepts like headings or tables, so recovering titles, paragraphs, superscripts/subscripts, headers, and footers is essential for downstream slicing.
Table Structure Recognition : Requires locating the table region, determining its grid, and extracting cell text, handling both full‑frame and partial‑frame tables.
Reading Order Restoration : After obtaining bounding boxes, reconstruct a human‑like reading sequence using rule‑based (e.g., xy‑cut) or deep‑learning methods such as LayoutReader.
Alibaba Cloud Search Document Parsing Architecture
The system processes DOC/DOCX on the left/middle and PDF on the right. For PDF, a rule‑based approach is chosen to handle large‑scale, diverse documents without GPU constraints.
Supported output elements (markdown): multi‑level headings, natural paragraphs, images, tables (full/partial), superscripts/subscripts, headers/footers, reading order, OCR for images, and PPT‑type optimisation.
Performance
On a test set of 53 papers, average processing time is 8.4 s per document (0.5 s per page). Table extraction accuracy reaches 90 % precision and 94 % recall, with a 6 % error rate.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Alibaba Cloud Big Data AI Platform
The Alibaba Cloud Big Data AI Platform builds on Alibaba’s leading cloud infrastructure, big‑data and AI engineering capabilities, scenario algorithms, and extensive industry experience to offer enterprises and developers a one‑stop, cloud‑native big‑data and AI capability suite. It boosts AI development efficiency, enables large‑scale AI deployment across industries, and drives business value.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
