PaddleOCR‑VL‑1.5: 0.9B Model Beats Billion‑Parameter OCR Models with 94.5% Accuracy

PaddleOCR‑VL‑1.5, the latest Baidu release, uses only 0.9 B parameters to achieve 94.5% accuracy on OmniDocBench v1.5, surpassing larger open‑source and commercial OCR models, while offering multi‑task, multi‑language support, lightweight deployment, and detailed performance benchmarks.

Old Zhang's AI Learning
Old Zhang's AI Learning
Old Zhang's AI Learning
PaddleOCR‑VL‑1.5: 0.9B Model Beats Billion‑Parameter OCR Models with 94.5% Accuracy

Model Overview

PaddleOCR-VL-1.5 is a visual‑language model fine‑tuned from ERNIE‑4.5‑0.3B, containing 0.9 B parameters. It targets multi‑task document parsing in real‑world conditions.

PaddleOCR-VL-1.5 model architecture and task overview
PaddleOCR-VL-1.5 model architecture and task overview

Core Capabilities

94.5 % overall accuracy on OmniDocBench v1.5, with state‑of‑the‑art (SOTA) performance on table, formula, and text recognition.

Robustness across five adverse scenarios—scanning, skew, curvature, screen capture, and uneven lighting—each achieving SOTA on the Real5‑OmniDocBench benchmark.

Added multi‑task support: Text Spotting and Seal Recognition, both SOTA on their respective tasks.

Multi‑language support for Chinese, English, Tibetan, and Bengali, including rare characters and ancient scripts.

Long‑document handling with automatic cross‑page table merging and paragraph‑title recognition.

Installation

Install PaddlePaddle 3.2.1+ (CUDA 12.6) and PaddleOCR with the doc‑parser extra:

# Install PaddlePaddle (CUDA 12.6)
python -m pip install paddlepaddle-gpu==3.2.1 -i https://www.paddlepaddle.org.cn/packages/stable/cu126/
# Install PaddleOCR
python -m pip install -U "paddleocr[doc-parser]"

macOS requires a Docker environment.

Usage Options

Command‑line

paddleocr doc_parser -i your_document.png

Python API

from paddleocr import PaddleOCRVL
pipeline = PaddleOCRVL()
output = pipeline.predict("your_document.png")
for res in output:
    res.print()
    res.save_to_json(save_path="output")
    res.save_to_markdown(save_path="output")

vLLM high‑performance inference

docker run \
    --rm \
    --gpus all \
    --network host \
    ccr-2vdh3abv-pub.cnc.baidubce.com/paddlepaddle/paddleocr-genai-vllm-server:latest-nvidia-gpu \
    paddleocr genai_server --model_name PaddleOCR-VL-1.5-0.9B --host 0.0.0.0 --port 8080 --backend vllm
from paddleocr import PaddleOCRVL
pipeline = PaddleOCRVL(vl_rec_backend="vllm-server", vl_rec_server_url="http://127.0.0.1:8080/v1")
output = pipeline.predict("your_document.png")

Transformers direct loading

from PIL import Image
import torch
from transformers import AutoProcessor, AutoModelForImageTextToText

model_path = "PaddlePaddle/PaddleOCR-VL-1.5"
image = Image.open("test.png").convert("RGB")
model = AutoModelForImageTextToText.from_pretrained(
    model_path, torch_dtype=torch.bfloat16
).to("cuda").eval()
processor = AutoProcessor.from_pretrained(model_path)

messages = [{"role": "user", "content": [
    {"type": "image", "image": image},
    {"type": "text", "text": "OCR:"}
]}]
inputs = processor.apply_chat_template(
    messages, add_generation_prompt=True, tokenize=True,
    return_dict=True, return_tensors="pt"
).to(model.device)
outputs = model.generate(**inputs, max_new_tokens=512)
result = processor.decode(
    outputs[0][inputs["input_ids"].shape[-1]:]
)
print(result)

Requires transformers ≥ 5.0.0.

Benchmark Results

Official OmniDocBench v1.5 results:

Overall accuracy: 94.5 % (SOTA)

Table, formula, text recognition: SOTA

Reading order: SOTA

Real5‑OmniDocBench (five adverse conditions) results: SOTA for scanning, skew, curvature, screen capture, and uneven lighting.

Inference speed on a single A100 GPU processing a 512‑page PDF batch is reported as leading among comparable models.

OmniDocBench v1.5 performance comparison
OmniDocBench v1.5 performance comparison
Inference speed comparison
Inference speed comparison

Comparison with DeepSeek‑OCR

Higher robustness on real‑world scenarios.

Model size: 0.9 B vs. 7 B+, reducing deployment cost.

More comprehensive multi‑task abilities, including seal recognition and text spotting.

Limitations

Transformers interface supports only element‑level recognition; page‑level parsing is recommended via the official API.

macOS deployment requires Docker.

GPU acceleration yields the best performance.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

OCRGPU inferencePaddleOCRVLMMulti-languageOmniDocBenchDeepSeek-OCRmulti-task
Old Zhang's AI Learning
Written by

Old Zhang's AI Learning

AI practitioner specializing in large-model evaluation and on-premise deployment, agents, AI programming, Vibe Coding, general AI, and broader tech trends, with daily original technical articles.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.