Why GLM-OCR Leads OCR Benchmarks: 0.9B Model Tops OmniDocBench

GLM-OCR, a 0.9B‑parameter multimodal OCR model from Zhipu, achieves the highest score (94.62) on OmniDocBench V1.5, offers lightweight deployment via vLLM, Ollama, API and SDK, and outperforms larger rivals like DeepSeek‑OCR and PaddleOCR in speed and accuracy.

Old Zhang's AI Learning
Old Zhang's AI Learning
Old Zhang's AI Learning
Why GLM-OCR Leads OCR Benchmarks: 0.9B Model Tops OmniDocBench

Model Overview

GLM-OCR is an open‑source OCR model built on a GLM‑V encoder‑decoder architecture with 0.9 B parameters. It combines a CogViT visual encoder pretrained on large image‑text data, a lightweight cross‑modal connector that performs token down‑sampling, a GLM‑0.5B language decoder for text generation, and the PP‑DocLayout‑V3 layout analyzer (two‑stage layout analysis + parallel recognition pipeline). Multi‑token prediction (MTP) loss and stable full‑task reinforcement learning are used to improve training efficiency, accuracy and generalisation.

Benchmark Performance

On OmniDocBench V1.5 the model achieves 94.62 points, ranking first among evaluated OCR systems. The score surpasses DeepSeek‑OCR and Mixed‑Yuan OCR on formula, table and information‑extraction tasks.

Throughput on identical hardware (single replica, single concurrency):

PDF documents – 1.86 pages/second

Images – 0.67 images/second

The high speed is attributed to the small parameter count, which reduces inference overhead.

Key Features

Small footprint – 0.9 B parameters enable deployment on consumer‑grade GPUs with vLLM, SGLang or Ollama.

Real‑world optimisation – Handles complex tables, code documents and seal recognition that cause failures in other OCR models.

Open‑source SDK and inference toolchain – One‑line integration into existing pipelines.

Deployment Options

1. Cloud API (quick start)

# Install SDK
pip install zai-sdk

# Python call
from zai import ZaiClient
client = ZaiClient(api_key="your-api-key")
response = client.layout_parsing.create(
    model="glm-ocr",
    file="https://cdn.bigmodel.cn/static/logo/introduction.png"
)
print(response)

API keys are obtained from https://open.bigmodel.cn

2. Ollama (local, one‑click)

# Run model
ollama run glm-ocr

# Recognise an image (drag‑and‑drop path)
ollama run glm-ocr "Text Recognition: ./image.png"

3. vLLM (production‑grade)

# Install vLLM
pip install -U vllm --extra-index-url https://wheels.vllm.ai/nightly

# Install latest Transformers from source
pip install git+https://github.com/huggingface/transformers.git

# Start service
vllm serve zai-org/GLM-OCR --allowed-local-media-path / --port 8080

Docker alternative:

docker pull vllm/vllm-openai:nightly

4. SGLang

# Docker install
docker pull lmsysorg/sglang:dev

# Or source install
pip install git+https://github.com/sgl-project/sglang.git#subdirectory=python

# Launch server
python -m sglang.launch_server --model zai-org/GLM-OCR --port 8080

SDK and CLI Usage

# Clone repository and install in editable mode
git clone https://github.com/zai-org/glm-ocr.git
cd glm-ocr && pip install -e .

# Install Transformers from source
pip install git+https://github.com/huggingface/transformers.git

Command‑line examples:

# Parse a single image
glmocr parse examples/source/code.png

# Parse a directory
glmocr parse examples/source/

# Specify output directory
glmocr parse examples/source/code.png --output ./results/

Python API example:

from glmocr import GlmOcr, parse

# Simple parse
result = parse("image.png")
result.save(output_dir="./results")

# Class‑style usage
with GlmOcr() as parser:
    result = parser.parse("image.png")
    print(result.json_result)
    result.save()

Prompt Formats

Two supported scenarios:

Document parsing – extract raw text, formulas and tables using a simple JSON mapping, e.g.

{
  "text": "Text Recognition:",
  "formula": "Formula Recognition:",
  "table": "Table Recognition:"
}

Information extraction – output must follow a strict JSON schema, for example:

{
  "id_number": "",
  "last_name": "",
  "first_name": "",
  "date_of_birth": "",
  "address": {
    "street": "",
    "city": "",
    "state": "",
    "zip_code": ""
  },
  "dates": {
    "issue_date": "",
    "expiration_date": ""
  },
  "sex": ""
}

Comparison with Other OCR Models

GLM-OCR – 0.9 B parameters, 94.62 on OmniDocBench V1.5, deployment via vLLM/SGLang/Ollama/API, MIT license.

DeepSeek-OCR – 8 B parameters, no OmniDocBench score reported, deployment via vLLM, MIT license.

Mixed‑Yuan OCR – multiple versions, no OmniDocBench score reported, deployment via vLLM/SGLang, Apache 2.0 license.

PaddleOCR‑VL – >2 B parameters, no OmniDocBench score reported, deployment via PaddlePaddle, Apache 2.0 license.

Model Downloads

Hugging Face: https://huggingface.co/zai-org/GLM-OCR

ModelScope: https://modelscope.cn/models/ZhipuAI/GLM-OCR

GitHub SDK: https://github.com/zai-org/GLM-OCR

Online Demo

Demo URL: https://ocr.z.ai/

API documentation: https://docs.z.ai/guides/vlm/glm-ocr

deploymentOCRvLLMOllamaOmniDocBenchGLM-OCR
Old Zhang's AI Learning
Written by

Old Zhang's AI Learning

AI practitioner specializing in large-model evaluation and on-premise deployment, agents, AI programming, Vibe Coding, general AI, and broader tech trends, with daily original technical articles.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.