Artificial Intelligence 9 min read

RTPrune: Two‑Stage Reading‑Inspired Token Pruning for Efficient DeepSeek‑OCR Inference

The paper presents RTPrune, a token‑pruning technique for DeepSeek‑OCR that exploits a two‑stage reading behavior in LLM decoding, first keeping high‑norm visual tokens and then fusing the rest via optimal‑transport matching with a dynamic pruning‑rate strategy, achieving up to 15% GFLOPs reduction and 18.9% speedup while preserving over 99% OCR accuracy across multiple benchmarks.

JD Retail Technology

Jun 2, 2026

RTPrune: Two‑Stage Reading‑Inspired Token Pruning for Efficient DeepSeek‑OCR Inference

Background

Visual‑language models (VLMs) have achieved strong multimodal performance, but optical‑character‑recognition (OCR) remains difficult. DeepSeek‑OCR reduces the cost of long‑text processing by using a small set of visual tokens, yet many tokens still contain redundant textual and structural information.

Two‑Stage Reading Observation

Analysis of LLM decoding attention reveals a consistent two‑stage reading pattern. In shallow layers, attention concentrates on visual tokens with the highest L2‑norm of their feature embeddings because these tokens encode core text and layout. In deeper layers, attention shifts to additional tokens, including the remaining high‑norm tokens, to supplement contextual cues. This observation motivates a two‑stage pruning strategy.

RTPrune Method

Stage 1 – Dominant Token Selection : Compute the L2‑norm of each visual token’s feature vector. Rank tokens by norm and retain the top‑k tokens, where *k* is determined by a target pruning rate. The remaining tokens form a candidate pool.

Stage 2 – Optimal Token Fusion : Construct an optimal‑transport (OT) matching matrix P between retained tokens and candidate tokens based on feature similarity. Tokens that cannot be matched (high redundancy) are sent to a “trash bin”. Matched candidate tokens are weighted‑fused into their paired retained tokens, preserving information while shortening the sequence.

Dynamic Pruning‑Rate Strategy : For each input image, compute two metrics: (1) token‑wise feature similarity φ (measuring non‑textual redundancy) and (2) text density ρ (estimated with Sobel edge detection). Images with dense text receive a conservative pruning rate; background‑rich images are pruned more aggressively, yielding an adaptive trade‑off between efficiency and OCR fidelity.

Experimental Evaluation

Effectiveness : RTPrune was evaluated on OmniDocBench, OlmOCR‑Bench, and Ocean‑OCR Benchmark. Compared with prior token‑pruning baselines, it reduced GFLOPs by 15.29 % and shortened pre‑fill latency by 18.90 % while maintaining 99.47 % OCR accuracy, achieving the best accuracy‑efficiency balance.

Generalization : The method was further tested on recent end‑to‑end OCR models (DeepSeek‑OCR2, LightOnOCR, GLM‑OCR). Across all models RTPrune retained >95 % of the original performance and delivered substantial speed gains with minimal accuracy loss, demonstrating broad applicability.

Ablation Studies : Component analysis shows that L2‑norm ranking outperforms variance‑based or other importance metrics; OT‑based merging yields the highest overall performance; and the dynamic pruning‑rate improves accuracy by up to 13.5 % in high‑pruning scenarios compared with a fixed pruning rate.

References

[1] DeepSeek‑OCR 2: Visual Causal Flow. arXiv 2026.

[2] LightOnOCR: A 1B End‑to‑End Multilingual Vision‑Language Model for State‑of‑the‑Art OCR. arXiv 2026.

[3] GLM‑OCR Technical Report. arXiv 2026.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

visual language models token pruning dynamic pruning optimal transport DeepSeek-OCR OCR efficiency

Written by

JD Retail Technology

Official platform of JD Retail Technology, delivering insightful R&D news and a deep look into the lives and work of technologists.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.