Why GLM-OCR Leads OCR Benchmarks: 0.9B Model Tops OmniDocBench
GLM-OCR, a 0.9B‑parameter multimodal OCR model from Zhipu, achieves the highest score (94.62) on OmniDocBench V1.5, offers lightweight deployment via vLLM, Ollama, API and SDK, and outperforms larger rivals like DeepSeek‑OCR and PaddleOCR in speed and accuracy.
Model Overview
GLM-OCR is an open‑source OCR model built on a GLM‑V encoder‑decoder architecture with 0.9 B parameters. It combines a CogViT visual encoder pretrained on large image‑text data, a lightweight cross‑modal connector that performs token down‑sampling, a GLM‑0.5B language decoder for text generation, and the PP‑DocLayout‑V3 layout analyzer (two‑stage layout analysis + parallel recognition pipeline). Multi‑token prediction (MTP) loss and stable full‑task reinforcement learning are used to improve training efficiency, accuracy and generalisation.
Benchmark Performance
On OmniDocBench V1.5 the model achieves 94.62 points, ranking first among evaluated OCR systems. The score surpasses DeepSeek‑OCR and Mixed‑Yuan OCR on formula, table and information‑extraction tasks.
Throughput on identical hardware (single replica, single concurrency):
PDF documents – 1.86 pages/second
Images – 0.67 images/second
The high speed is attributed to the small parameter count, which reduces inference overhead.
Key Features
Small footprint – 0.9 B parameters enable deployment on consumer‑grade GPUs with vLLM, SGLang or Ollama.
Real‑world optimisation – Handles complex tables, code documents and seal recognition that cause failures in other OCR models.
Open‑source SDK and inference toolchain – One‑line integration into existing pipelines.
Deployment Options
1. Cloud API (quick start)
# Install SDK
pip install zai-sdk
# Python call
from zai import ZaiClient
client = ZaiClient(api_key="your-api-key")
response = client.layout_parsing.create(
model="glm-ocr",
file="https://cdn.bigmodel.cn/static/logo/introduction.png"
)
print(response)API keys are obtained from https://open.bigmodel.cn
2. Ollama (local, one‑click)
# Run model
ollama run glm-ocr
# Recognise an image (drag‑and‑drop path)
ollama run glm-ocr "Text Recognition: ./image.png"3. vLLM (production‑grade)
# Install vLLM
pip install -U vllm --extra-index-url https://wheels.vllm.ai/nightly
# Install latest Transformers from source
pip install git+https://github.com/huggingface/transformers.git
# Start service
vllm serve zai-org/GLM-OCR --allowed-local-media-path / --port 8080Docker alternative:
docker pull vllm/vllm-openai:nightly4. SGLang
# Docker install
docker pull lmsysorg/sglang:dev
# Or source install
pip install git+https://github.com/sgl-project/sglang.git#subdirectory=python
# Launch server
python -m sglang.launch_server --model zai-org/GLM-OCR --port 8080SDK and CLI Usage
# Clone repository and install in editable mode
git clone https://github.com/zai-org/glm-ocr.git
cd glm-ocr && pip install -e .
# Install Transformers from source
pip install git+https://github.com/huggingface/transformers.gitCommand‑line examples:
# Parse a single image
glmocr parse examples/source/code.png
# Parse a directory
glmocr parse examples/source/
# Specify output directory
glmocr parse examples/source/code.png --output ./results/Python API example:
from glmocr import GlmOcr, parse
# Simple parse
result = parse("image.png")
result.save(output_dir="./results")
# Class‑style usage
with GlmOcr() as parser:
result = parser.parse("image.png")
print(result.json_result)
result.save()Prompt Formats
Two supported scenarios:
Document parsing – extract raw text, formulas and tables using a simple JSON mapping, e.g.
{
"text": "Text Recognition:",
"formula": "Formula Recognition:",
"table": "Table Recognition:"
}Information extraction – output must follow a strict JSON schema, for example:
{
"id_number": "",
"last_name": "",
"first_name": "",
"date_of_birth": "",
"address": {
"street": "",
"city": "",
"state": "",
"zip_code": ""
},
"dates": {
"issue_date": "",
"expiration_date": ""
},
"sex": ""
}Comparison with Other OCR Models
GLM-OCR – 0.9 B parameters, 94.62 on OmniDocBench V1.5, deployment via vLLM/SGLang/Ollama/API, MIT license.
DeepSeek-OCR – 8 B parameters, no OmniDocBench score reported, deployment via vLLM, MIT license.
Mixed‑Yuan OCR – multiple versions, no OmniDocBench score reported, deployment via vLLM/SGLang, Apache 2.0 license.
PaddleOCR‑VL – >2 B parameters, no OmniDocBench score reported, deployment via PaddlePaddle, Apache 2.0 license.
Model Downloads
Hugging Face: https://huggingface.co/zai-org/GLM-OCR
ModelScope: https://modelscope.cn/models/ZhipuAI/GLM-OCR
GitHub SDK: https://github.com/zai-org/GLM-OCR
Online Demo
Demo URL: https://ocr.z.ai/
API documentation: https://docs.z.ai/guides/vlm/glm-ocr
Old Zhang's AI Learning
AI practitioner specializing in large-model evaluation and on-premise deployment, agents, AI programming, Vibe Coding, general AI, and broader tech trends, with daily original technical articles.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
