Artificial Intelligence 13 min read

Turning PDFs and Word Docs into Searchable Knowledge for RAG Systems

This article explains why generic large language models struggle with domain‑specific data, introduces Retrieval‑Augmented Generation (RAG) as a solution, compares Word and PDF formats, outlines document‑parsing pipelines, reviews open‑source PDF tools, and presents Alibaba Cloud's rule‑based parsing architecture with performance results.

Alibaba Cloud Big Data AI Platform

Sep 2, 2024

Turning PDFs and Word Docs into Searchable Knowledge for RAG Systems

Background

Although general large language models (LLMs) excel at knowledge‑question answering, they cannot answer domain‑specific queries because the specialized data is not publicly available and the models have never seen it. Fine‑tuning LLMs on private data is costly, so Retrieval‑Augmented Generation (RAG) adds relevant private data to the user query before feeding it to a generic LLM, improving answer quality.

Why Document Parsing Matters

RAG requires a searchable knowledge base. Most professional documents are stored as unstructured PDFs or Word files, which contain titles, paragraphs, tables, and images that are easy for humans but hard for computers. Converting these files into semi‑structured formats (e.g., markdown or HTML) enables slicing, vectorisation, and efficient retrieval.

Word vs. PDF

Word (DOCX) is edit‑oriented, follows the Office Open XML standard, stores content as XML with concepts like headings, paragraphs, and tables, but lacks explicit page or position information.

PDF is read‑oriented, stores drawing commands that fix the visual layout on a page, preserving exact positions but lacking structural concepts such as headings or tables.

Examples of DOCX and PDF internal structures are shown in the original code snippets.

DOCX Parsing

A DOCX file is a zip archive containing XML files. The main content resides in word/document.xml. Key tags include the root element, body, paragraph ( <w:p>), run ( <w:r>), text ( <w:t>), and section properties.

DOC Parsing

Legacy DOC files are OLE compound documents. They lack high‑level structural tags, so the common practice is to convert DOC to DOCX (e.g., via LibreOffice) before parsing.

Open‑Source PDF Tools

Popular Python libraries for PDF processing include:

PDFMiner – powerful but complex API.

PyPDF – lightweight, easy for basic tasks.

PyMuPDF (fitz) – fast, full‑featured.

PDFPlumber – built on PDFMiner, excels at table extraction.

Camelot – visual table extraction.

Papermage – wraps PDFPlumber and adds deep‑learning layout analysis.

Papermage Pipeline

Pure Text Extraction : Use PDFPlumber to obtain words and detect lines.

Visual Annotation : Rasterise each page, run an EfficientDet object detector (via layoutparser) to obtain blocks with bounding boxes and labels (e.g., image, table).

Character‑Level Annotation : Feed words, block IDs, line IDs, and visual labels to an I‑VILA layout model, which predicts element types such as Title, Author, Abstract, Paragraph, Table, Figure, etc.

Sample JSON input to the model:

{
  "words": ["word1", "word2", ...],
  "block_ids": [0,0,0,1,...],
  "line_ids": [0,1,1,2,...],
  "labels": [0,0,0,1,...]
}

Sample output mapping IDs to element names is also provided.

Key Challenges

Layout Element Recovery : PDFs lack concepts like headings or tables, so recovering titles, paragraphs, superscripts/subscripts, headers, and footers is essential for downstream slicing.

Table Structure Recognition : Requires locating the table region, determining its grid, and extracting cell text, handling both full‑frame and partial‑frame tables.

Reading Order Restoration : After obtaining bounding boxes, reconstruct a human‑like reading sequence using rule‑based (e.g., xy‑cut) or deep‑learning methods such as LayoutReader.

Alibaba Cloud Search Document Parsing Architecture

The system processes DOC/DOCX on the left/middle and PDF on the right. For PDF, a rule‑based approach is chosen to handle large‑scale, diverse documents without GPU constraints.

Supported output elements (markdown): multi‑level headings, natural paragraphs, images, tables (full/partial), superscripts/subscripts, headers/footers, reading order, OCR for images, and PPT‑type optimisation.

Performance

On a test set of 53 papers, average processing time is 8.4 s per document (0.5 s per page). Table extraction accuracy reaches 90 % precision and 94 % recall, with a 6 % error rate.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

AI LLM RAG PDF Document Parsing Word

Written by

Alibaba Cloud Big Data AI Platform

The Alibaba Cloud Big Data AI Platform builds on Alibaba’s leading cloud infrastructure, big‑data and AI engineering capabilities, scenario algorithms, and extensive industry experience to offer enterprises and developers a one‑stop, cloud‑native big‑data and AI capability suite. It boosts AI development efficiency, enables large‑scale AI deployment across industries, and drives business value.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.