DeepSeek-OCR 2 Enables AI to Read Images with Human‑Like Logical Flow
DeepSeek-OCR 2 introduces Visual Causal Flow and a LLM‑based visual encoder, achieving 91.09% accuracy on OmniDocBench v1.5, while providing detailed installation, two inference modes (vLLM and Transformers), and an analysis of its strengths and limitations for complex document processing.
Overview
DeepSeek-OCR 2 introduces Visual Causal Flow, allowing the model to select visual regions via learnable query vectors instead of a fixed left‑to‑right raster scan. This addresses errors on multi‑column layouts, tables, and formulas.
Architecture
DeepEncoder V2 replaces the CLIP visual encoder with a compact LLM (modified Qwen2‑0.5B). The encoder rearranges visual tokens into a human‑logical order before feeding them to the downstream language model. Learnable queries reorder visual information, as illustrated in the architecture diagram.
Technical Highlights
Visual Causal Flow : Learnable query vectors let the model dynamically decide where to attend.
LLM as visual encoder : Qwen2‑0.5B provides inference capability within the encoder.
Efficient token compression : Visual token count limited to 256–1120, balancing information richness and inference speed.
Performance : On OmniDocBench v1.5 the overall accuracy reaches 91.09 % , a 3.73 % improvement over the previous generation, with a large lead in reading‑order accuracy.
Installation
Requirements: CUDA ≥ 11.8, PyTorch 2.6.0.
# 1. Clone the repository
git clone https://github.com/deepseek-ai/DeepSeek-OCR-2.git
cd DeepSeek-OCR-2
# 2. Create a conda environment
conda create -n deepseek-ocr2 python=3.12.9 -y
conda activate deepseek-ocr2
# 3. Install dependencies (pay attention to the vLLM version)
pip install torch==2.6.0 torchvision==0.21.0 torchaudio==2.6.0 --index-url https://download.pytorch.org/whl/cu118
pip install https://github.com/vllm-project/vllm/releases/download/v0.8.5/vllm-0.8.5+cu118-cp312-cp312-manylinux1_x86_64.whl
pip install -r requirements.txt
pip install flash-attn==2.7.3 --no-build-isolationUsage
Two inference modes are provided: a fast vLLM backend (production) and a Transformers backend (debugging).
Transformers mode (debugging)
from transformers import AutoModel, AutoTokenizer
import torch, os
os.environ["CUDA_VISIBLE_DEVICES"] = "0"
model_name = "deepseek-ai/DeepSeek-OCR-2"
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
model = AutoModel.from_pretrained(model_name, trust_remote_code=True, use_safetensors=True)
model = model.eval().cuda().to(torch.bfloat16)
prompt = "<image>
<|grounding|>Convert the document to markdown. "
image_file = "your_image_path.jpg"
output_path = "./output"
res = model.infer(
tokenizer,
prompt=prompt,
image_file=image_file,
output_path=output_path,
base_size=1024,
image_size=768,
crop_mode=True,
save_results=True,
)
print(f"Results saved to {output_path}")vLLM mode (production)
Run the provided scripts in the DeepSeek-OCR2-vllm directory. run_dpsk_ocr2_image.py: stream output for a single image. run_dpsk_ocr2_pdf.py: batch processing of PDFs with high concurrency.
cd DeepSeek-OCR2-master/DeepSeek-OCR2-vllm
python run_dpsk_ocr2_image.pyEmpirical Evaluation
Official benchmark results on OmniDocBench v1.5 show 91.09 % overall accuracy, a 3.73 % gain over the previous generation, and a notable improvement in reading‑order accuracy.
Advantages and Limitations
Logical ordering : Visual Causal Flow mitigates column‑mixing errors common in earlier OCR systems.
Compatibility : The architecture resembles a pure LLM with a visual “glasses” module, facilitating future extensions.
Resource requirements : The 0.5 B LLM encoder increases GPU memory and compute demand compared with pure CNN‑based OCR models.
Dependency constraints : Requires recent PyTorch and vLLM versions; older hardware may need additional configuration.
Conclusion
DeepSeek-OCR 2 demonstrates that a language model can serve as an effective visual encoder, enabling logical restructuring of visual tokens and improving OCR performance on complex document layouts.
Old Zhang's AI Learning
AI practitioner specializing in large-model evaluation and on-premise deployment, agents, AI programming, Vibe Coding, general AI, and broader tech trends, with daily original technical articles.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
