Open-Source PDF Toolkit Delivers High-Accuracy Layout and Formula Detection
PDF‑Extract‑Kit is an open‑source toolkit that combines high‑accuracy layout detection, formula detection, formula recognition, and OCR for PDFs, and the article details its model comparisons, evaluation on academic and textbook datasets, and step‑by‑step instructions for running it on Windows or macOS, including Apple Silicon.
PDF-Extract-Kit is an open‑source toolkit that provides layout detection, formula detection, formula recognition, and OCR for PDF documents.
Usage Examples
Layout Detection
Formula Detection
Exam Paper Detection
Evaluation Metrics
Layout Detection
Models compared: DocXchain, Surya, two models from 360LayoutAnalysis, and LayoutLMv3‑SFT (fine‑tuned from LayoutLMv3‑base‑chinese). Validation sets: 402 academic‑paper pages and 587 textbook pages.
Formula Detection
Models evaluated: Pix2Text‑MFD (open‑source) and a YOLOv8‑l fine‑tuned version (YOLOv8‑Trained). Validation sets: 255 academic‑paper pages and 789 mixed pages (textbooks, books, etc.).
Formula Recognition
Uses Unimernet weights without additional fine‑tuning.
Quick Start (macOS M‑series)
Clone the repository
git clone https://github.com/opendatalab/PDF-Extract-Kit.gitCreate and activate a virtual environment
python -m venv venv
source venv/bin/activateInstall dependencies
pip install -r requirements+cpu.txt
# Detectron2 requires compilation; see https://github.com/facebookresearch/detectron2/issues/5114
# or install the pre‑built wheel
pip install https://github.com/opendatalab/PDF-Extract-Kit/raw/main/assets/whl/detectron2-0.6-cp310-cp310-macosx_11_0_arm64.whlConfigure MPS acceleration by editing configs/model_configs.yaml and setting device to mps:
model_args:
device: mps
img_size: 1888
conf_thres: 0.25
iou_thres: 0.45
pdf_dpi: 200
layout_weight: models/Layout/model_final.pth
mfd_weight: models/MFD/weights.pt
mfr_weight: models/MFR/UniMERNetRun the program
python pdf_extract.py --pdf demo/demo1.pdfhttps://github.com/opendatalab/PDF-Extract-Kit
References
PDF-Extract-Kit: https://github.com/opendatalab/PDF-Extract-Kit
Unimernet: https://github.com/opendatalab/UniMERNet
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Full-Stack Cultivation Path
Focused on sharing practical tech content about TypeScript, Vue 3, front-end architecture, and source code analysis.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
