How Deepseek-OCR Achieves SOTA Using Ultra‑Low Visual Token Counts

Deepseek-OCR leverages a visual‑compression approach, combining DeepEncoder and the DeepSeek3B‑MoE‑A570M decoder, to represent document text with far fewer visual tokens, achieving up to 97% OCR accuracy and surpassing GOT‑OCR2.0 and MinerU2.0 on OmniDocBench, while the article offers a one‑click deployment tutorial.

HyperAI Super Neural
HyperAI Super Neural
HyperAI Super Neural
How Deepseek-OCR Achieves SOTA Using Ultra‑Low Visual Token Counts

Large language models (LLMs) face rapidly increasing computation when processing long texts, which limits their efficiency in high‑density text scenarios. Deepseek-OCR proposes a different angle: using visual perception to "read" text, treating a document image as a compact representation of its textual content.

The system introduces a form of "optical compression" where the visual modality serves as an effective compression medium, yielding a token‑compression ratio far beyond traditional character‑based representations.

Deepseek-OCR consists of two components: DeepEncoder , which extracts image features, performs tokenization, and compresses the visual representation, and DeepSeek3B‑MoE‑A570M , a decoder that generates the desired output from the visual tokens and prompts. DeepEncoder is designed to stay in a low‑activation state on high‑resolution inputs while maintaining a high compression rate, ensuring that the number of visual tokens remains manageable.

Experiments show that when the number of text tokens is less than ten times the number of visual tokens (compression ratio < 10×), the model reaches 97 % OCR decoding accuracy. Even at a 20× compression ratio, accuracy remains around 60 %.

Beyond OCR, the release highlights potential for long‑context compression and research into LLM memory‑forgetting mechanisms.

On the OmniDocBench benchmark, Deepseek-OCR using only 100 visual tokens outperforms GOT‑OCR2.0 (which uses 256 tokens per page) and exceeds MinerU2.0 performance (average > 6000 tokens per page) while staying under 800 visual tokens. In production, a single A100‑40G GPU can generate over 200 k pages of training data per day for LLMs/VLMs.

One‑click deployment tutorial

1. Visit the HyperAI homepage, select the Deepseek-OCR tutorial, and click “Run this tutorial online”.

2. After the page loads, click the top‑right “Clone” button to copy the tutorial into your own container (language can be switched between Chinese and English).

3. Choose the “NVIDIA GeForce RTX 5090” GPU and the “PyTorch” image, select a billing plan, and click “Continue job execution”.

4. Wait roughly three minutes for the container to start; once the status shows “Running”, click the arrow next to the API address to open the demo page.

In the demo, upload a document image and press “Extract Text”. The model first segments text and chart regions, then outputs the result in Markdown format.

The article concludes by inviting readers to try the tutorial and explore the capabilities of visual‑token compression for LLMs.

LLMOCRtutorialvisual compressionDeepEncoderOmniDocBenchDeepSeek-OCR
HyperAI Super Neural
Written by

HyperAI Super Neural

Deconstructing the sophistication and universality of technology, covering cutting-edge AI for Science case studies.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.