Tagged articles

token pruning

4 articles · Page 1 of 1

Jun 2, 2026 · Artificial Intelligence

RTPrune: Two‑Stage Reading‑Inspired Token Pruning for Efficient DeepSeek‑OCR Inference

The paper presents RTPrune, a token‑pruning technique for DeepSeek‑OCR that exploits a two‑stage reading behavior in LLM decoding, first keeping high‑norm visual tokens and then fusing the rest via optimal‑transport matching with a dynamic pruning‑rate strategy, achieving up to 15% GFLOPs reduction and 18.9% speedup while preserving over 99% OCR accuracy across multiple benchmarks.

DeepSeek-OCROCR efficiencydynamic pruning

0 likes · 9 min read

RTPrune: Two‑Stage Reading‑Inspired Token Pruning for Efficient DeepSeek‑OCR Inference

Data Party THU

Aug 22, 2025 · Artificial Intelligence

TwigVLM: How Tiny Branches Accelerate Large Vision‑Language Models

TwigVLM introduces a lightweight “twig” module that prunes visual tokens early and enables self‑speculative decoding, achieving up to 154% speedup on long‑text generation while preserving 96% of original LVLM accuracy, as demonstrated on LLaVA‑1.5‑7B and other benchmarks.

LVLMMultimodal AImodel acceleration

0 likes · 14 min read

TwigVLM: How Tiny Branches Accelerate Large Vision‑Language Models

AIWalker

Jul 15, 2025 · Artificial Intelligence

Dynamic Vision Mamba: Re‑ordering Pruning and Adaptive Block Selection Cut FLOPs by 35.2%

This article presents Dynamic Vision Mamba (DyVM), a method that tackles token and block redundancy in Mamba‑based visual models through a novel re‑ordering pruning strategy and dynamic block selection, achieving a 35.2% FLOPs reduction with only a 1.7% accuracy loss while demonstrating strong generalization across tasks and architectures.

Dynamic Block SelectionFLOPs ReductionModel Efficiency

0 likes · 22 min read

Dynamic Vision Mamba: Re‑ordering Pruning and Adaptive Block Selection Cut FLOPs by 35.2%

JD Cloud Developers

Apr 27, 2025 · Artificial Intelligence

Overcoming the Hourglass Effect in Residual Quantization for Generative Retrieval

This paper investigates the “hourglass” phenomenon in residual‑quantized semantic identifiers for generative search and recommendation, revealing that token concentration in intermediate codebooks causes path sparsity and long‑tail distributions, and proposes heuristic layer removal and adaptive token‑pruning strategies that markedly improve model performance.

Generative Retrievalhourglass phenomenonresidual quantization

0 likes · 13 min read

Overcoming the Hourglass Effect in Residual Quantization for Generative Retrieval