Frontier OCR Advances: DeepSeek, Tencent, and Baidu Push From Text Recognition to Structured Document Understanding

This weekly AI paper roundup reviews five cutting‑edge OCR studies—DeepSeek‑OCR 2, LightOnOCR‑2‑1B, HunyuanOCR, PaddleOCR‑VL, and GOT—detailing their novel visual‑language architectures, training data, benchmark evaluations, and performance gains over previous models.

HyperAI Super Neural
HyperAI Super Neural
HyperAI Super Neural
Frontier OCR Advances: DeepSeek, Tencent, and Baidu Push From Text Recognition to Structured Document Understanding

Overview

Recent years have seen OCR evolve from simple character‑recognition tools into general‑purpose document‑understanding systems built around visual‑language models. Major players such as Microsoft, Google, Baidu, Tencent, and Alibaba Cloud are driving this shift toward intelligent document processing (IDP) that tackles complex layout, multimodal symbols, long‑context modeling, and end‑to‑end semantic understanding.

Paper 1: DeepSeek‑OCR 2 – Visual Causal Flow

DeepSeek researchers extend DeepSeek‑OCR with DeepSeek‑OCR 2, introducing a new encoder called DeepEncoderV2 that dynamically reorders visual tokens based on semantic cues, giving the model causal reasoning ability before LLM‑based content understanding. The training mix consists of OCR 1.0, OCR 2.0, and generic vision data, with OCR data comprising 80% of the mixture. Evaluation uses OmniDocBench v1.5 , a benchmark of 1,355 multilingual pages covering magazines, academic papers, and research reports across nine categories.

DeepSeek‑OCR 2 architecture example
DeepSeek‑OCR 2 architecture example

Paper 2: LightOnOCR‑2‑1B

LightOn introduces a compact 1‑billion‑parameter multilingual vision‑language model that directly extracts clean, ordered text from document images. It surpasses larger models in accuracy while adding image‑localization ability via RLVR and improving robustness through checkpoint merging. The dataset combines OCR 1.0, OCR 2.0, and generic visual data (80% OCR), plus teacher‑annotated pages, GPT‑4o‑labeled regions (paragraphs, titles, abstracts), blank‑page samples to suppress hallucination, and TeX‑derived supervision from arXiv via the nvpdftex pipeline.

LightOnOCR architecture example
LightOnOCR architecture example

Paper 3: HunyuanOCR

Developed by Tencent and collaborators, HunyuanOCR is an open‑source 1‑billion‑parameter visual‑language model that unifies end‑to‑end OCR capabilities—including text localization, document parsing, information extraction, and translation—via a lightweight ViT‑LLM MLP adapter. On OmniDocBench, HunyuanOCR achieves a total score of 94.10 , outperforming all larger models and commercial APIs.

HunyuanOCR architecture example
HunyuanOCR architecture example

Paper 4: PaddleOCR‑VL

Baidu’s team presents PaddleOCR‑VL, a resource‑efficient visual‑language model that combines a NaViT‑style dynamic‑resolution encoder with the ERNIE‑4.5‑0.3B model. It delivers state‑of‑the‑art multilingual document parsing, accurately recognizing tables, formulas, and other complex elements while maintaining fast inference. On OmniDocBench v1.5, PaddleOCR‑VL records a best overall score of 92.86 , surpassing MinerU2.5‑1.2B (90.67) and achieving superior metrics on text (edit distance 0.035), formulas (CDM 91.22), tables (TEDS 90.89, TEDS‑S 94.76), and reading order (0.043).

PaddleOCR‑VL table recognition example
PaddleOCR‑VL table recognition example

Paper 5: GOT (Unified OCR‑2.0)

Researchers from StepFun, Megvii, the Chinese Academy of Sciences, and Tsinghua University propose GOT , a 580‑million‑parameter unified end‑to‑end OCR‑2.0 model. It uses a high‑compression encoder and a long‑context decoder to extend recognition from plain text to mathematical formulas, tables, charts, and geometric figures. The model supports slice or full‑page input, formatted outputs (Markdown/TikZ/SMILES), interactive region‑level recognition, dynamic resolution, and multi‑page processing. Training on an 8×8 L40s GPU proceeds in three stages: pre‑training (3 epochs, batch 128, lr 1e‑4), joint training (1 epoch, max token 6000), and post‑training (1 epoch, max token 8192, lr 2e‑5), retaining 80% of the data for the final stage. Benchmark results on ChartQA‑SE and PlotQA‑SE are presented as illustrative examples.

GOT architecture example
GOT architecture example

Conclusion

The five papers illustrate a rapid transition from rule‑driven OCR toward integrated visual‑language systems that jointly model vision and language, handle complex layouts, and deliver end‑to‑end semantic understanding, marking a new era for intelligent document processing.

OCRDeepSeekDocument UnderstandingGOTVision Language ModelLightOnOCRPaddleOCR-VL
HyperAI Super Neural
Written by

HyperAI Super Neural

Deconstructing the sophistication and universality of technology, covering cutting-edge AI for Science case studies.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.