Unlimited OCR Achieves SOTA Long-Document Parsing in a Single Forward Pass

Unlimited OCR, Baidu's open‑source model built on DeepSeek OCR, uses a novel Reference Sliding Window Attention to compress visual tokens and keep KV cache size constant, enabling end‑to‑end parsing of whole books with 93.23% OmniDocBench v1.5 score and stable latency across dozens of pages.

Machine Heart
Machine Heart
Machine Heart
Unlimited OCR Achieves SOTA Long-Document Parsing in a Single Forward Pass

Unlimited OCR is an open‑source OCR model released by Baidu that can process an entire book in a single forward pass under the standard 32K context limit, eliminating the need for page‑by‑page for‑loop processing or external schedulers.

On the mainstream document‑parsing benchmark OmniDocBench v1.5, Unlimited OCR achieves a total score of 93.23%, a full 6 percentage‑point improvement over DeepSeek OCR, establishing a new end‑to‑end SOTA.

The model is built directly on DeepSeek OCR. DeepEncoder compresses a 1024×1024 page image to only 256 visual tokens, an aggressive reduction that already eases the pre‑fill stage. However, the decoding side suffers from KV‑cache growth: each generated token adds new key/value entries, increasing memory usage, attention cost, and latency.

To address this, Unlimited OCR replaces the standard multi‑head attention with Reference Sliding Window Attention (R‑SWA). R‑SWA splits the model’s visible information into two parts: (1) reference tokens, consisting of the visual tokens and prompt, which remain constantly visible; and (2) a recent‑output window (default size 128 tokens) that acts as a short‑term memory. This design keeps the KV cache size fixed during decoding, mirroring how a human copying a book only remembers the current page and the last few written characters.

Experiments with Flash Attention v3 kernels show that Unlimited OCR’s per‑call latency remains essentially constant as decoding length grows, while DeepSeek OCR’s latency spikes due to expanding KV cache. Correspondingly, GPU memory usage stays flat for Unlimited OCR but grows linearly for DeepSeek OCR. Detailed per‑category analysis (Table 2) demonstrates consistent gains across complex layouts such as PPTs, newspapers, magazines, and notes.

Long‑document tests on an internal dataset ranging from 2 to 40+ pages reveal that Unlimited OCR maintains low edit distance (< 0.11) and high Distinct‑35 (~ 97 %) even for 40+ pages, confirming its robustness for multi‑page OCR.

Throughput (TPS) comparisons show that at short lengths (256 tokens) both models perform similarly, but as output length increases, DeepSeek OCR’s TPS declines sharply; at 6000 tokens Unlimited OCR is about 35 % faster, matching the latency trends observed earlier.

The report also notes a mysterious technical director “YY” and speculates that the author may have come from DeepSeek, given the strong stylistic and technical continuity between the two projects.

Report title: Unlimited OCR Works

Report link: https://huggingface.co/baidu/Unlimited-OCR/blob/main/Unlimited-OCR.pdf

Project repository: https://github.com/baidu/Unlimited-OCR

Hugging Face model page: https://huggingface.co/baidu/Unlimited-OCR

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

OCRDeepSeekLarge Language ModelLong DocumentR-SWAUnlimited OCR
Machine Heart
Written by

Machine Heart

Professional AI media and industry service platform

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.