Artificial Intelligence 6 min read

Open-Source PDF Toolkit Delivers High-Accuracy Layout and Formula Detection

PDF‑Extract‑Kit is an open‑source toolkit that combines high‑accuracy layout detection, formula detection, formula recognition, and OCR for PDFs, and the article details its model comparisons, evaluation on academic and textbook datasets, and step‑by‑step instructions for running it on Windows or macOS, including Apple Silicon.

Full-Stack Cultivation Path

Jul 17, 2024

Open-Source PDF Toolkit Delivers High-Accuracy Layout and Formula Detection

PDF-Extract-Kit is an open‑source toolkit that provides layout detection, formula detection, formula recognition, and OCR for PDF documents.

Usage Examples

Layout Detection

Formula Detection

Exam Paper Detection

Evaluation Metrics

Layout Detection

Models compared: DocXchain, Surya, two models from 360LayoutAnalysis, and LayoutLMv3‑SFT (fine‑tuned from LayoutLMv3‑base‑chinese). Validation sets: 402 academic‑paper pages and 587 textbook pages.

Formula Detection

Models evaluated: Pix2Text‑MFD (open‑source) and a YOLOv8‑l fine‑tuned version (YOLOv8‑Trained). Validation sets: 255 academic‑paper pages and 789 mixed pages (textbooks, books, etc.).

Formula Recognition

Uses Unimernet weights without additional fine‑tuning.

Quick Start (macOS M‑series)

Clone the repository

git clone https://github.com/opendatalab/PDF-Extract-Kit.git

Create and activate a virtual environment

python -m venv venv
source venv/bin/activate

Install dependencies

pip install -r requirements+cpu.txt
# Detectron2 requires compilation; see https://github.com/facebookresearch/detectron2/issues/5114
# or install the pre‑built wheel
pip install https://github.com/opendatalab/PDF-Extract-Kit/raw/main/assets/whl/detectron2-0.6-cp310-cp310-macosx_11_0_arm64.whl

Configure MPS acceleration by editing configs/model_configs.yaml and setting device to mps:

model_args:
  device: mps
  img_size: 1888
  conf_thres: 0.25
  iou_thres: 0.45
  pdf_dpi: 200
  layout_weight: models/Layout/model_final.pth
  mfd_weight: models/MFD/weights.pt
  mfr_weight: models/MFR/UniMERNet

Run the program

python pdf_extract.py --pdf demo/demo1.pdf

https://github.com/opendatalab/PDF-Extract-Kit

References

PDF-Extract-Kit: https://github.com/opendatalab/PDF-Extract-Kit

Unimernet: https://github.com/opendatalab/UniMERNet

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

computer vision layout detection OCR open source formula detection PDF-Extract-Kit

Written by

Full-Stack Cultivation Path

Focused on sharing practical tech content about TypeScript, Vue 3, front-end architecture, and source code analysis.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.