Can Visual Tokens Compress Text? Inside DeepSeek-OCR’s Optical Compression
DeepSeek‑OCR introduces a novel visual encoder that transforms text into images, achieving up to 10‑20× token compression while maintaining OCR accuracy, and demonstrates strong performance on OmniDocBench with a 3B‑parameter model across multilingual and multimodal tasks.
Title: DeepSeek-OCR: Contexts Optical Compression
Link: https://github.com/deepseek-ai/DeepSeek-OCR/blob/main/DeepSeek_OCR_paper.pdfKey Contributions
Proposes DeepEncoder, a visual encoding structure that efficiently extracts visual features at high resolution and dramatically reduces the number of visual tokens.
Analyzes the feasibility of visual‑text token compression, showing that representing text as images can achieve roughly 10× compression while preserving readability.
Builds a large‑scale OCR training pipeline (~60 M image‑text pairs) that achieves a better token‑performance trade‑off on OmniDocBench and other benchmarks.
Core Idea: Contexts Optical Compression
Large language models struggle with long contexts because token length inflates memory and compute costs. DeepSeek’s approach is to compress textual information into visual tokens, allowing the model to store the same information with far fewer tokens. For example, a 10 k‑word document that normally requires 5‑6 k language tokens can be represented with roughly 500 visual tokens.
Model Architecture
The system consists of three main components:
DeepEncoder (visual encoder) : Combines a SAM‑base module for local features and a CLIP‑large module for global semantics. A two‑layer convolutional block downsamples visual tokens by 16× (e.g., a 1024×1024 image is split into 4096 patches and reduced to 256 tokens).
Mapping Layer : Converts visual tokens into a format understandable by the language model.
Text Decoder (DeepSeek‑V2‑3B) : A Mixture‑of‑Experts decoder with 64 experts (6 active per token), using only ~570 M parameters during inference, offering higher efficiency than a full 3 B model.
Multi‑Resolution Input
DeepEncoder supports six resolution modes—Tiny, Small, Base, Large, Gundam, and Gundam‑M—allowing flexible trade‑offs between detail and speed. Gundam modes combine multiple local views with a global view for ultra‑high‑resolution documents (e.g., newspapers, financial reports).
Training Data
The training corpus comprises four categories:
OCR 1.0: 43 M images of traditional documents and scene text.
OCR 2.0: 16 M images covering charts, chemical formulas, geometric diagrams, etc.
General visual data (≈20 %): to retain broad image understanding.
Pure text data (≈10 %): to preserve language modeling capability.
Training is performed in two stages on 20 nodes (8 × A100‑40G each) using pipeline parallelism.
Stage 1 – DeepEncoder Pre‑training
Data: OCR 1.0 + OCR 2.0 + 100 M sampled LAION images (≈160 M image‑text pairs).
Batch size: 1280
Epochs: 2
Optimizer: AdamW
Learning rate: 5e‑5 (cosine annealing)
Maximum sequence length: 4096
Stage 2 – Joint Vision‑Language Training
Data composition: OCR 1.0, OCR 2.0, general visual data (20 %), pure text data (10 %).
Hardware: 20 × 8 A100‑40G, pipeline parallel training.
Batch size: 640
Optimizer: AdamW
Initial learning rate: 3e‑5
Loss: Autoregressive language modeling loss computed as cross‑entropy over decoded visual tokens.
This two‑stage strategy first teaches the encoder to produce stable visual tokens, then trains the language model to decode those tokens back into text.
Experimental Results
On the OmniDocBench benchmark, DeepSeek‑OCR achieves comparable or better accuracy with far fewer visual tokens than competing models:
GOT‑OCR2.0: 256 tokens, Edit Distance 0.287
MinerU2.0: 6 790 tokens, Edit Distance 0.133
DeepSeek‑OCR (Base): 256 tokens (182 effective), Edit Distance 0.137
DeepSeek‑OCR (Gundam‑M): 1 853 tokens, Edit Distance 0.123
The model also handles tables, charts, formulas, and geometric diagrams, supports nearly 100 languages, and retains general visual abilities such as image captioning and object detection.
Future Directions
The authors liken optical compression to human memory decay: older context can be rendered as lower‑resolution images, gradually fading while new content remains sharp. This “controllable forgetting” concept could inspire new approaches to long‑context management in future large models.
Conclusion
DeepSeek‑OCR is not a conventional OCR system; it is an experimental platform exploring how visual modalities can compress textual information. The results confirm that, with appropriate encoding, text can be efficiently represented by visual tokens without sacrificing readability, opening possibilities for more efficient long‑document processing and data generation pipelines.
Baobao Algorithm Notes
Author of the BaiMian large model, offering technology and industry insights.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
