PaddleOCR‑VL‑1.5: 0.9B Model Beats Billion‑Parameter OCR Models with 94.5% Accuracy
PaddleOCR‑VL‑1.5, the latest Baidu release, uses only 0.9 B parameters to achieve 94.5% accuracy on OmniDocBench v1.5, surpassing larger open‑source and commercial OCR models, while offering multi‑task, multi‑language support, lightweight deployment, and detailed performance benchmarks.
Model Overview
PaddleOCR-VL-1.5 is a visual‑language model fine‑tuned from ERNIE‑4.5‑0.3B, containing 0.9 B parameters. It targets multi‑task document parsing in real‑world conditions.
Core Capabilities
94.5 % overall accuracy on OmniDocBench v1.5, with state‑of‑the‑art (SOTA) performance on table, formula, and text recognition.
Robustness across five adverse scenarios—scanning, skew, curvature, screen capture, and uneven lighting—each achieving SOTA on the Real5‑OmniDocBench benchmark.
Added multi‑task support: Text Spotting and Seal Recognition, both SOTA on their respective tasks.
Multi‑language support for Chinese, English, Tibetan, and Bengali, including rare characters and ancient scripts.
Long‑document handling with automatic cross‑page table merging and paragraph‑title recognition.
Installation
Install PaddlePaddle 3.2.1+ (CUDA 12.6) and PaddleOCR with the doc‑parser extra:
# Install PaddlePaddle (CUDA 12.6)
python -m pip install paddlepaddle-gpu==3.2.1 -i https://www.paddlepaddle.org.cn/packages/stable/cu126/
# Install PaddleOCR
python -m pip install -U "paddleocr[doc-parser]"macOS requires a Docker environment.
Usage Options
Command‑line
paddleocr doc_parser -i your_document.pngPython API
from paddleocr import PaddleOCRVL
pipeline = PaddleOCRVL()
output = pipeline.predict("your_document.png")
for res in output:
res.print()
res.save_to_json(save_path="output")
res.save_to_markdown(save_path="output")vLLM high‑performance inference
docker run \
--rm \
--gpus all \
--network host \
ccr-2vdh3abv-pub.cnc.baidubce.com/paddlepaddle/paddleocr-genai-vllm-server:latest-nvidia-gpu \
paddleocr genai_server --model_name PaddleOCR-VL-1.5-0.9B --host 0.0.0.0 --port 8080 --backend vllm from paddleocr import PaddleOCRVL
pipeline = PaddleOCRVL(vl_rec_backend="vllm-server", vl_rec_server_url="http://127.0.0.1:8080/v1")
output = pipeline.predict("your_document.png")Transformers direct loading
from PIL import Image
import torch
from transformers import AutoProcessor, AutoModelForImageTextToText
model_path = "PaddlePaddle/PaddleOCR-VL-1.5"
image = Image.open("test.png").convert("RGB")
model = AutoModelForImageTextToText.from_pretrained(
model_path, torch_dtype=torch.bfloat16
).to("cuda").eval()
processor = AutoProcessor.from_pretrained(model_path)
messages = [{"role": "user", "content": [
{"type": "image", "image": image},
{"type": "text", "text": "OCR:"}
]}]
inputs = processor.apply_chat_template(
messages, add_generation_prompt=True, tokenize=True,
return_dict=True, return_tensors="pt"
).to(model.device)
outputs = model.generate(**inputs, max_new_tokens=512)
result = processor.decode(
outputs[0][inputs["input_ids"].shape[-1]:]
)
print(result)Requires transformers ≥ 5.0.0.
Benchmark Results
Official OmniDocBench v1.5 results:
Overall accuracy: 94.5 % (SOTA)
Table, formula, text recognition: SOTA
Reading order: SOTA
Real5‑OmniDocBench (five adverse conditions) results: SOTA for scanning, skew, curvature, screen capture, and uneven lighting.
Inference speed on a single A100 GPU processing a 512‑page PDF batch is reported as leading among comparable models.
Comparison with DeepSeek‑OCR
Higher robustness on real‑world scenarios.
Model size: 0.9 B vs. 7 B+, reducing deployment cost.
More comprehensive multi‑task abilities, including seal recognition and text spotting.
Limitations
Transformers interface supports only element‑level recognition; page‑level parsing is recommended via the official API.
macOS deployment requires Docker.
GPU acceleration yields the best performance.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Old Zhang's AI Learning
AI practitioner specializing in large-model evaluation and on-premise deployment, agents, AI programming, Vibe Coding, general AI, and broader tech trends, with daily original technical articles.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
