How PaddleOCR‑VL‑1.6’s 0.9B Model Achieved 96.33% SOTA on OmniDocBench v1.6

PaddleOCR‑VL‑1.6, a compact 0.9B visual‑language model, diagnoses three types of weak regions, enriches targeted data, and applies a three‑stage CPT‑SFT‑RL training pipeline to reach a 96.33% overall score on OmniDocBench v1.6, surpassing much larger models across all document‑parsing tasks.

SuanNi
SuanNi
SuanNi
How PaddleOCR‑VL‑1.6’s 0.9B Model Achieved 96.33% SOTA on OmniDocBench v1.6

Document Parsing and Model Overview

Document parsing converts scanned pages into machine‑readable structures (text, tables, formulas, charts, seals, reading order, layout). A compact 0.9 B‑parameter model, PaddleOCR‑VL‑1.6, achieved a 96.33 % overall score on OmniDocBench v1.6, leading the leaderboard.

Diagnosing Weak Points

Analysis of the predecessor PaddleOCR‑VL‑1.5 (94.93 % on OmniDocBench v1.5) showed that remaining errors cluster in three regions:

Boundary‑Fragile Regions : Minor visual perturbations (pixel shift, JPEG compression, slight blur) cause large output changes, indicating unstable decision boundaries.

Coverage‑Sparse Regions : Samples that appear in the training set are still mispredicted because the surrounding data distribution is sparse, causing long‑tail patterns to be overwhelmed by dominant distributions.

Unreliable‑Supervision Regions : High‑confidence errors arise from incorrect labels. Three external expert models (Qianfan‑OCR, GLM‑OCR, MinerU2.5‑Pro) are used to cross‑validate and correct these labels.

Targeted high‑value annotations were added for each region.

Three‑Step Training Pipeline

The optimization framework combines a model‑driven data engine with a progressive post‑training strategy:

CPT (Continued Pre‑Training) : 16.8 M samples, including long‑tail documents (ancient books, rare characters, industrial tables) and corrected annotations, broaden the model’s coverage.

SFT (Supervised Fine‑Tuning) : 7.3 M hard samples are selected via Uncertainty‑Aware Cluster Sampling (UACS), expert‑disagreement cases, and corrected unreliable‑supervision samples to sharpen performance on fragile regions.

RL (Reinforcement Learning) : 49 K samples are processed with GRPO. A high‑potential sample mining strategy evaluates candidates on learning potential, uncertainty, and reward variance, ensuring only informative samples influence the 0.9 B model.

Reward functions map complex document‑parsing outputs to verifiable signals, incorporating legality, structural correction constraints, and real‑score dimensions.

Model Architecture

The system consists of two models: PP‑DocLayout V3 for layout analysis and PaddleOCR‑VL‑1.6‑0.9B for visual‑language understanding. The architecture mirrors PaddleOCR‑VL‑1.5, featuring a Native Resolution Visual Encoder, Adaptive MLP Connector, and ERNIE‑4.5‑0.3B language model. No architectural changes or parameter increases were made; gains stem solely from smarter data strategies and refined training.

Benchmark Results

OmniDocBench v1.6 adds Multi‑Granularity Adaptive Matching (MGAM) and a 296‑page Hard subset covering nested tables, dense formulas, and unconventional structures. PaddleOCR‑VL‑1.6 achieved:

Total score: 96.33 % (↑1.4 pts vs. 1.5)

Text edit distance: 0.033

Formula CDM: 97.49 %

Table TEDS: 94.76 %

Table‑structure TEDS: 97.11 %

Reading‑order score: 0.127

On Real5‑OmniDocBench (simulating real‑world captures with scanning, bending, phone photos, lighting changes, and tilt), PaddleOCR‑VL‑1.6 scored 93.19 % (↑1.14 pts), outranking 235 B Qwen3‑VL, 241 B InternVL3.5, 1 T KimiK2.5, and GPT‑5.2.

Task‑Specific Gains

Hard table recognition (1 258 samples, 20 table types): TEDS 91.71, structure TEDS 94.67 (≈+2 pts over MinerU2.5‑Pro).

Chart parsing (1 801 samples, 11 chart types): RMS‑F1 91.74, Chinese chart F1 93.37 (↑11 pts over previous generation).

Text localization (9 dimensions): total 87.47, with improvements on ancient scripts, Japanese, and handwritten Chinese.

Seal recognition: NED 0.119, far better than Qwen3‑VL‑235B’s 0.382.

The methodology—diagnosing weak zones, precisely augmenting data, and applying progressive CPT‑SFT‑RL training—demonstrates a viable path for continuously improving performance of compact models, enabling low‑cost deployment on edge devices for document digitization, invoice processing, and archive management.

References

https://huggingface.co/PaddlePaddle/PaddleOCR-VL-1.6

https://modelscope.cn/models/PaddlePaddle/PaddleOCR-VL-1.6

https://github.com/PaddlePaddle/PaddleOCR

https://arxiv.org/pdf/2606.03264

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

visual language modelSOTAtraining pipelinedocument OCROmniDocBenchPaddleOCR-VL-1.6
SuanNi
Written by

SuanNi

A community for AI developers that aggregates large-model development services, models, and compute power.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.