DeepSeek-OCR 2 Enables AI to Read Images with Human‑Like Logical Flow

DeepSeek-OCR 2 introduces Visual Causal Flow and a LLM‑based visual encoder, achieving 91.09% accuracy on OmniDocBench v1.5, while providing detailed installation, two inference modes (vLLM and Transformers), and an analysis of its strengths and limitations for complex document processing.

Old Zhang's AI Learning
Old Zhang's AI Learning
Old Zhang's AI Learning
DeepSeek-OCR 2 Enables AI to Read Images with Human‑Like Logical Flow

Overview

DeepSeek-OCR 2 introduces Visual Causal Flow, allowing the model to select visual regions via learnable query vectors instead of a fixed left‑to‑right raster scan. This addresses errors on multi‑column layouts, tables, and formulas.

Architecture

DeepEncoder V2 replaces the CLIP visual encoder with a compact LLM (modified Qwen2‑0.5B). The encoder rearranges visual tokens into a human‑logical order before feeding them to the downstream language model. Learnable queries reorder visual information, as illustrated in the architecture diagram.

DeepEncoder V2 architecture comparison
DeepEncoder V2 architecture comparison

Technical Highlights

Visual Causal Flow : Learnable query vectors let the model dynamically decide where to attend.

LLM as visual encoder : Qwen2‑0.5B provides inference capability within the encoder.

Efficient token compression : Visual token count limited to 256–1120, balancing information richness and inference speed.

Performance : On OmniDocBench v1.5 the overall accuracy reaches 91.09 % , a 3.73 % improvement over the previous generation, with a large lead in reading‑order accuracy.

Installation

Requirements: CUDA ≥ 11.8, PyTorch 2.6.0.

# 1. Clone the repository
git clone https://github.com/deepseek-ai/DeepSeek-OCR-2.git
cd DeepSeek-OCR-2

# 2. Create a conda environment
conda create -n deepseek-ocr2 python=3.12.9 -y
conda activate deepseek-ocr2

# 3. Install dependencies (pay attention to the vLLM version)
pip install torch==2.6.0 torchvision==0.21.0 torchaudio==2.6.0 --index-url https://download.pytorch.org/whl/cu118
pip install https://github.com/vllm-project/vllm/releases/download/v0.8.5/vllm-0.8.5+cu118-cp312-cp312-manylinux1_x86_64.whl
pip install -r requirements.txt
pip install flash-attn==2.7.3 --no-build-isolation

Usage

Two inference modes are provided: a fast vLLM backend (production) and a Transformers backend (debugging).

Transformers mode (debugging)

from transformers import AutoModel, AutoTokenizer
import torch, os

os.environ["CUDA_VISIBLE_DEVICES"] = "0"
model_name = "deepseek-ai/DeepSeek-OCR-2"

tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
model = AutoModel.from_pretrained(model_name, trust_remote_code=True, use_safetensors=True)
model = model.eval().cuda().to(torch.bfloat16)

prompt = "<image>
<|grounding|>Convert the document to markdown. "
image_file = "your_image_path.jpg"
output_path = "./output"

res = model.infer(
    tokenizer,
    prompt=prompt,
    image_file=image_file,
    output_path=output_path,
    base_size=1024,
    image_size=768,
    crop_mode=True,
    save_results=True,
)
print(f"Results saved to {output_path}")

vLLM mode (production)

Run the provided scripts in the DeepSeek-OCR2-vllm directory. run_dpsk_ocr2_image.py: stream output for a single image. run_dpsk_ocr2_pdf.py: batch processing of PDFs with high concurrency.

cd DeepSeek-OCR2-master/DeepSeek-OCR2-vllm
python run_dpsk_ocr2_image.py

Empirical Evaluation

Official benchmark results on OmniDocBench v1.5 show 91.09 % overall accuracy, a 3.73 % gain over the previous generation, and a notable improvement in reading‑order accuracy.

Advantages and Limitations

Logical ordering : Visual Causal Flow mitigates column‑mixing errors common in earlier OCR systems.

Compatibility : The architecture resembles a pure LLM with a visual “glasses” module, facilitating future extensions.

Resource requirements : The 0.5 B LLM encoder increases GPU memory and compute demand compared with pure CNN‑based OCR models.

Dependency constraints : Requires recent PyTorch and vLLM versions; older hardware may need additional configuration.

Conclusion

DeepSeek-OCR 2 demonstrates that a language model can serve as an effective visual encoder, enabling logical restructuring of visual tokens and improving OCR performance on complex document layouts.

LLMOCRvLLMOmniDocBenchDeepSeek-OCR 2DeepEncoder V2Visual Causal Flow
Old Zhang's AI Learning
Written by

Old Zhang's AI Learning

AI practitioner specializing in large-model evaluation and on-premise deployment, agents, AI programming, Vibe Coding, general AI, and broader tech trends, with daily original technical articles.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.