How POINTS-Reader Achieves State‑of‑the‑Art PDF Extraction Without Teacher Models
The POINTS-Reader paper, accepted at EMNLP 2025, introduces a two‑stage, fully automated data generation pipeline that enables a lightweight visual‑language model to extract text, tables, and LaTeX formulas from diverse PDF layouts with superior performance and high throughput, all without relying on costly teacher‑model distillation.
Introduction
PDF documents are a dominant medium for information exchange, but extracting their rich content—including plain text, mathematical formulas, and tables—remains challenging. Existing approaches fall into three categories: (1) traditional parsers (e.g., PyMuPDF) that often lose complex structures; (2) pipeline solutions (e.g., MinerU, Mathpix) that depend on multiple proprietary models; and (3) end‑to‑end methods that require large, high‑quality training data, which is difficult to obtain.
POINTS-Reader proposes a highly scalable two‑stage data generation scheme consisting of a Uniform Format Warm‑up Stage (UWS) and an Iterative Self‑improvement Stage (ISS), dramatically improving extraction efficiency and providing a solid foundation for continual model improvement.
Model Performance
The model, already accepted at EMNLP 2025 and open‑sourced, achieves leading results on OmniDocBench (English score 0.133, Chinese score 0.212) and surpasses several larger private models.
Model Highlights
Simplicity : POINTS-Reader retains the POINTS‑1.5 architecture, swapping Qwen2.5‑7B‑Instruct for Qwen2.5‑3B‑Instruct. Input consists of a fixed prompt and a document image; output is a single string of extracted text, requiring no post‑processing.
Performance : Supports both Chinese and English documents with impressive benchmark scores.
High Throughput : Uses a medium‑size ViT (600M‑parameter NaViT) to avoid bottlenecks, combined with native SGLang support for very high inference speed; vLLM support is forthcoming.
Open‑source Solution : Introduces a two‑phase data‑augmentation strategy that first pre‑trains on synthetic data and then iteratively self‑improves on real data, applicable to any model.
Method
Stage 1: Uniform Format Warm‑up Stage (UWS)
The goal is to give the model a solid foundation for handling diverse document elements.
Unified Output Format :
Plain text – Markdown.
Tables – HTML (preserves merged cells).
Formulas – LaTeX, with inline $...$ and display $$...$$ syntax.
Large‑scale Synthetic Data Generation :
Use a powerful LLM (e.g., Qwen2.5‑72B) to generate varied content covering plain text, formulas, tables, and multi‑column layouts.
Apply rule‑based filtering to ensure syntactic correctness of formulas and tables.
Render the filtered content into images via HTML templates, pairing each image with its source text to create “image‑text” pairs.
Model Fine‑tuning : Fine‑tune a generic vision‑language model (e.g., POINTS‑1.5) on the synthetic pairs, endowing it with initial document‑element extraction capabilities.
Stage 2: Iterative Self‑improvement Stage (ISS)
The aim is to adapt the pre‑trained model to real‑world documents and continuously improve both data quality and model performance.
Model Annotation : Use the Stage‑1 model to automatically label a large real‑document corpus (e.g., DocMatix).
Data Filtering : Apply rule‑based filters to the generated annotations:
Plain‑text filter – compute F1 against a traditional OCR (e.g., PaddleOCR) and keep samples above a high threshold (e.g., 0.9) to reduce hallucinations.
Table filter – verify structural consistency (uniform column counts).
Formula filter – check LaTeX syntax validity.
Model Re‑training : Retrain the model on the high‑quality filtered real data.
Iterative Loop : Repeat the “annotate → filter → retrain” cycle, progressively enhancing data quality and model accuracy.
Key Innovations
Eliminates distillation: No reliance on expensive or closed‑source teacher models.
Automated closed‑loop: Combines synthetic pre‑training with real‑data self‑improvement for continuous data‑model co‑evolution.
Rule‑based filtering: Simple yet effective quality control that enables fully self‑supervised improvement.
Experiments
Stronger Performance
POINTS‑Reader outperforms many larger private models on OmniDocBench and Fox benchmarks.
Ablation Studies
Uniform Format Warm‑up: richer element diversity and layout variety consistently improve performance.
Iterative Self‑improvement: filtering at each iteration (text, tables, formulas) markedly boosts data quality and model scores.
Increasing iteration count raises both data F1‑score and model performance, though gains diminish over time; the amount of filtered data also grows, confirming quality improvement.
Conclusion
We present a fully automated two‑stage data construction pipeline that first equips the model with basic document‑parsing abilities via uniform‑format synthetic pre‑training, then drives continual performance gains through an iterative “annotate‑filter‑retrain” loop on real data. The resulting end‑to‑end document parser achieves state‑of‑the‑art results while being lightweight, high‑throughput, and broadly applicable.
Tencent Technical Engineering
Official account of Tencent Technology. A platform for publishing and analyzing Tencent's technological innovations and cutting-edge developments.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
