How PaddleOCR‑VL‑1.6’s 0.9B Model Achieved 96.33% SOTA on OmniDocBench v1.6
PaddleOCR‑VL‑1.6, a compact 0.9B visual‑language model, diagnoses three types of weak regions, enriches targeted data, and applies a three‑stage CPT‑SFT‑RL training pipeline to reach a 96.33% overall score on OmniDocBench v1.6, surpassing much larger models across all document‑parsing tasks.
Document Parsing and Model Overview
Document parsing converts scanned pages into machine‑readable structures (text, tables, formulas, charts, seals, reading order, layout). A compact 0.9 B‑parameter model, PaddleOCR‑VL‑1.6, achieved a 96.33 % overall score on OmniDocBench v1.6, leading the leaderboard.
Diagnosing Weak Points
Analysis of the predecessor PaddleOCR‑VL‑1.5 (94.93 % on OmniDocBench v1.5) showed that remaining errors cluster in three regions:
Boundary‑Fragile Regions : Minor visual perturbations (pixel shift, JPEG compression, slight blur) cause large output changes, indicating unstable decision boundaries.
Coverage‑Sparse Regions : Samples that appear in the training set are still mispredicted because the surrounding data distribution is sparse, causing long‑tail patterns to be overwhelmed by dominant distributions.
Unreliable‑Supervision Regions : High‑confidence errors arise from incorrect labels. Three external expert models (Qianfan‑OCR, GLM‑OCR, MinerU2.5‑Pro) are used to cross‑validate and correct these labels.
Targeted high‑value annotations were added for each region.
Three‑Step Training Pipeline
The optimization framework combines a model‑driven data engine with a progressive post‑training strategy:
CPT (Continued Pre‑Training) : 16.8 M samples, including long‑tail documents (ancient books, rare characters, industrial tables) and corrected annotations, broaden the model’s coverage.
SFT (Supervised Fine‑Tuning) : 7.3 M hard samples are selected via Uncertainty‑Aware Cluster Sampling (UACS), expert‑disagreement cases, and corrected unreliable‑supervision samples to sharpen performance on fragile regions.
RL (Reinforcement Learning) : 49 K samples are processed with GRPO. A high‑potential sample mining strategy evaluates candidates on learning potential, uncertainty, and reward variance, ensuring only informative samples influence the 0.9 B model.
Reward functions map complex document‑parsing outputs to verifiable signals, incorporating legality, structural correction constraints, and real‑score dimensions.
Model Architecture
The system consists of two models: PP‑DocLayout V3 for layout analysis and PaddleOCR‑VL‑1.6‑0.9B for visual‑language understanding. The architecture mirrors PaddleOCR‑VL‑1.5, featuring a Native Resolution Visual Encoder, Adaptive MLP Connector, and ERNIE‑4.5‑0.3B language model. No architectural changes or parameter increases were made; gains stem solely from smarter data strategies and refined training.
Benchmark Results
OmniDocBench v1.6 adds Multi‑Granularity Adaptive Matching (MGAM) and a 296‑page Hard subset covering nested tables, dense formulas, and unconventional structures. PaddleOCR‑VL‑1.6 achieved:
Total score: 96.33 % (↑1.4 pts vs. 1.5)
Text edit distance: 0.033
Formula CDM: 97.49 %
Table TEDS: 94.76 %
Table‑structure TEDS: 97.11 %
Reading‑order score: 0.127
On Real5‑OmniDocBench (simulating real‑world captures with scanning, bending, phone photos, lighting changes, and tilt), PaddleOCR‑VL‑1.6 scored 93.19 % (↑1.14 pts), outranking 235 B Qwen3‑VL, 241 B InternVL3.5, 1 T KimiK2.5, and GPT‑5.2.
Task‑Specific Gains
Hard table recognition (1 258 samples, 20 table types): TEDS 91.71, structure TEDS 94.67 (≈+2 pts over MinerU2.5‑Pro).
Chart parsing (1 801 samples, 11 chart types): RMS‑F1 91.74, Chinese chart F1 93.37 (↑11 pts over previous generation).
Text localization (9 dimensions): total 87.47, with improvements on ancient scripts, Japanese, and handwritten Chinese.
Seal recognition: NED 0.119, far better than Qwen3‑VL‑235B’s 0.382.
The methodology—diagnosing weak zones, precisely augmenting data, and applying progressive CPT‑SFT‑RL training—demonstrates a viable path for continuously improving performance of compact models, enabling low‑cost deployment on edge devices for document digitization, invoice processing, and archive management.
References
https://huggingface.co/PaddlePaddle/PaddleOCR-VL-1.6
https://modelscope.cn/models/PaddlePaddle/PaddleOCR-VL-1.6
https://github.com/PaddlePaddle/PaddleOCR
https://arxiv.org/pdf/2606.03264
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
SuanNi
A community for AI developers that aggregates large-model development services, models, and compute power.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
