Artificial Intelligence 11 min read

How DeepSeek-OCR Achieves 10× Context Compression with Vision Tokens

DeepSeek-OCR, a newly open‑sourced 3B‑parameter OCR model, uses a novel DeepEncoder and a 3B MoE decoder to compress long‑text contexts into visual tokens, achieving up to 10× compression with 97% accuracy and demonstrating strong practical performance on benchmarks and multilingual documents.

DataFunTalk

Oct 20, 2025

How DeepSeek-OCR Achieves 10× Context Compression with Vision Tokens

Introduction

DeepSeek recently open‑sourced a new OCR model called DeepSeek‑OCR. The 3‑billion‑parameter model has quickly attracted over 100 downloads and is built by three DeepSeek researchers who previously developed the GOT‑OCR2.0 system.

Model Overview

DeepSeek‑OCR explores optical compression of long‑text contexts by converting text into visual tokens. The architecture consists of two core components: a visual encoder (DeepEncoder) and a decoder (DeepSeek‑3B‑MoE‑A570M).

DeepEncoder

DeepEncoder extracts image features, tokenizes and compresses them into a small number of visual tokens. It combines an 80M SAM‑base module with a 300M CLIP‑large module, totaling about 380M parameters. The encoder processes high‑resolution inputs with low activation, supports multi‑resolution, and produces a compact token set (e.g., 4096 patches reduced to 256 tokens after compression).

MoE Decoder

The decoder uses DeepSeek‑MoE (3B‑parameter Mixture‑of‑Experts). During inference, 6 out of 64 expert routers and 2 shared experts are activated, resulting in about 570M active parameters. This design offers the expressive power of a 3B model with the inference efficiency of a 500M‑parameter model.

Data Engine

DeepSeek assembled diverse training data:

OCR 1.0 – traditional scene and document OCR.

OCR 2.0 – complex synthetic images such as charts, chemical formulas, and geometric diagrams.

General visual data – to endow the model with broad image understanding.

Training Process

Training proceeds in two stages: first, DeepEncoder is trained independently using a next‑token prediction objective on OCR 1.0, OCR 2.0, and 100 M sampled LAION images (AdamW, cosine scheduler, 2 epochs, batch 1280, LR 5e‑5, seq‑len 4096). Second, the full DeepSeek‑OCR model is trained on the HAI‑LLM platform with pipeline parallelism across four stages, using 20 nodes (8 × A100‑40G per node), DP 40, global batch 640, AdamW (LR 3e‑5). Training speeds reach 9 × 10¹¹ tokens/day for text and 7 × 10¹¹ tokens/day for multimodal data.

Experimental Results

Visual‑Text Compression

On the Fox benchmark, DeepSeek‑OCR achieves ~97% OCR accuracy at a 10× compression ratio (100 visual tokens). Accuracy remains around 60% even at 20× compression.

Practical OCR Performance

DeepSeek‑OCR outperforms GOT‑OCR2.0 with only 100 visual tokens (640×640 resolution) and matches state‑of‑the‑art models with 400 tokens (1280×1280). With fewer than 800 tokens, it surpasses MinerU2.0, which requires ~7 000 tokens.

Qualitative Study

The model can parse charts, geometric figures, chemical formulas, and natural images using a single prompt. It supports recognition in nearly 100 languages, demonstrated on Arabic and Sinhala PDFs, and shows general visual understanding capabilities.

Source: Machine Heart (机器之心).

multimodal AI OCR DeepSeek Context Compression Vision Language Model

Written by

DataFunTalk

Dedicated to sharing and discussing big data and AI technology applications, aiming to empower a million data scientists. Regularly hosts live tech talks and curates articles on big data, recommendation/search algorithms, advertising algorithms, NLP, intelligent risk control, autonomous driving, and machine learning/deep learning.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.