Artificial Intelligence 10 min read

Can Visual Tokens Compress Text? Inside DeepSeek-OCR’s Optical Compression

DeepSeek‑OCR introduces a novel visual encoder that transforms text into images, achieving up to 10‑20× token compression while maintaining OCR accuracy, and demonstrates strong performance on OmniDocBench with a 3B‑parameter model across multilingual and multimodal tasks.

Baobao Algorithm Notes

Oct 20, 2025

Can Visual Tokens Compress Text? Inside DeepSeek-OCR’s Optical Compression

Title: DeepSeek-OCR: Contexts Optical Compression
Link: https://github.com/deepseek-ai/DeepSeek-OCR/blob/main/DeepSeek_OCR_paper.pdf

Key Contributions

Proposes DeepEncoder, a visual encoding structure that efficiently extracts visual features at high resolution and dramatically reduces the number of visual tokens.

Analyzes the feasibility of visual‑text token compression, showing that representing text as images can achieve roughly 10× compression while preserving readability.

Builds a large‑scale OCR training pipeline (~60 M image‑text pairs) that achieves a better token‑performance trade‑off on OmniDocBench and other benchmarks.

Core Idea: Contexts Optical Compression

Large language models struggle with long contexts because token length inflates memory and compute costs. DeepSeek’s approach is to compress textual information into visual tokens, allowing the model to store the same information with far fewer tokens. For example, a 10 k‑word document that normally requires 5‑6 k language tokens can be represented with roughly 500 visual tokens.

Model Architecture

The system consists of three main components:

DeepEncoder (visual encoder) : Combines a SAM‑base module for local features and a CLIP‑large module for global semantics. A two‑layer convolutional block downsamples visual tokens by 16× (e.g., a 1024×1024 image is split into 4096 patches and reduced to 256 tokens).

Mapping Layer : Converts visual tokens into a format understandable by the language model.

Text Decoder (DeepSeek‑V2‑3B) : A Mixture‑of‑Experts decoder with 64 experts (6 active per token), using only ~570 M parameters during inference, offering higher efficiency than a full 3 B model.

Multi‑Resolution Input

DeepEncoder supports six resolution modes—Tiny, Small, Base, Large, Gundam, and Gundam‑M—allowing flexible trade‑offs between detail and speed. Gundam modes combine multiple local views with a global view for ultra‑high‑resolution documents (e.g., newspapers, financial reports).

Training Data

The training corpus comprises four categories:

OCR 1.0: 43 M images of traditional documents and scene text.

OCR 2.0: 16 M images covering charts, chemical formulas, geometric diagrams, etc.

General visual data (≈20 %): to retain broad image understanding.

Pure text data (≈10 %): to preserve language modeling capability.

Training is performed in two stages on 20 nodes (8 × A100‑40G each) using pipeline parallelism.

Stage 1 – DeepEncoder Pre‑training

Data: OCR 1.0 + OCR 2.0 + 100 M sampled LAION images (≈160 M image‑text pairs).

Batch size: 1280

Epochs: 2

Optimizer: AdamW

Learning rate: 5e‑5 (cosine annealing)

Maximum sequence length: 4096

Stage 2 – Joint Vision‑Language Training

Data composition: OCR 1.0, OCR 2.0, general visual data (20 %), pure text data (10 %).

Hardware: 20 × 8 A100‑40G, pipeline parallel training.

Batch size: 640

Optimizer: AdamW

Initial learning rate: 3e‑5

Loss: Autoregressive language modeling loss computed as cross‑entropy over decoded visual tokens.

This two‑stage strategy first teaches the encoder to produce stable visual tokens, then trains the language model to decode those tokens back into text.

Experimental Results

On the OmniDocBench benchmark, DeepSeek‑OCR achieves comparable or better accuracy with far fewer visual tokens than competing models:

GOT‑OCR2.0: 256 tokens, Edit Distance 0.287

MinerU2.0: 6 790 tokens, Edit Distance 0.133

DeepSeek‑OCR (Base): 256 tokens (182 effective), Edit Distance 0.137

DeepSeek‑OCR (Gundam‑M): 1 853 tokens, Edit Distance 0.123

The model also handles tables, charts, formulas, and geometric diagrams, supports nearly 100 languages, and retains general visual abilities such as image captioning and object detection.

Future Directions

The authors liken optical compression to human memory decay: older context can be rendered as lower‑resolution images, gradually fading while new content remains sharp. This “controllable forgetting” concept could inspire new approaches to long‑context management in future large models.

Conclusion

DeepSeek‑OCR is not a conventional OCR system; it is an experimental platform exploring how visual modalities can compress textual information. The results confirm that, with appropriate encoding, text can be efficiently represented by visual tokens without sacrificing readability, opening possibilities for more efficient long‑document processing and data generation pipelines.