Artificial Intelligence 9 min read

Frontier OCR Advances: DeepSeek, Tencent, and Baidu Push From Text Recognition to Structured Document Understanding

This weekly AI paper roundup reviews five cutting‑edge OCR studies—DeepSeek‑OCR 2, LightOnOCR‑2‑1B, HunyuanOCR, PaddleOCR‑VL, and GOT—detailing their novel visual‑language architectures, training data, benchmark evaluations, and performance gains over previous models.

HyperAI Super Neural

Jan 30, 2026

Frontier OCR Advances: DeepSeek, Tencent, and Baidu Push From Text Recognition to Structured Document Understanding

Overview

Recent years have seen OCR evolve from simple character‑recognition tools into general‑purpose document‑understanding systems built around visual‑language models. Major players such as Microsoft, Google, Baidu, Tencent, and Alibaba Cloud are driving this shift toward intelligent document processing (IDP) that tackles complex layout, multimodal symbols, long‑context modeling, and end‑to‑end semantic understanding.

Paper 1: DeepSeek‑OCR 2 – Visual Causal Flow

DeepSeek researchers extend DeepSeek‑OCR with DeepSeek‑OCR 2, introducing a new encoder called DeepEncoderV2 that dynamically reorders visual tokens based on semantic cues, giving the model causal reasoning ability before LLM‑based content understanding. The training mix consists of OCR 1.0, OCR 2.0, and generic vision data, with OCR data comprising 80% of the mixture. Evaluation uses OmniDocBench v1.5 , a benchmark of 1,355 multilingual pages covering magazines, academic papers, and research reports across nine categories.

Paper 2: LightOnOCR‑2‑1B

LightOn introduces a compact 1‑billion‑parameter multilingual vision‑language model that directly extracts clean, ordered text from document images. It surpasses larger models in accuracy while adding image‑localization ability via RLVR and improving robustness through checkpoint merging. The dataset combines OCR 1.0, OCR 2.0, and generic visual data (80% OCR), plus teacher‑annotated pages, GPT‑4o‑labeled regions (paragraphs, titles, abstracts), blank‑page samples to suppress hallucination, and TeX‑derived supervision from arXiv via the nvpdftex pipeline.

Paper 3: HunyuanOCR

Developed by Tencent and collaborators, HunyuanOCR is an open‑source 1‑billion‑parameter visual‑language model that unifies end‑to‑end OCR capabilities—including text localization, document parsing, information extraction, and translation—via a lightweight ViT‑LLM MLP adapter. On OmniDocBench, HunyuanOCR achieves a total score of 94.10 , outperforming all larger models and commercial APIs.

Paper 4: PaddleOCR‑VL

Baidu’s team presents PaddleOCR‑VL, a resource‑efficient visual‑language model that combines a NaViT‑style dynamic‑resolution encoder with the ERNIE‑4.5‑0.3B model. It delivers state‑of‑the‑art multilingual document parsing, accurately recognizing tables, formulas, and other complex elements while maintaining fast inference. On OmniDocBench v1.5, PaddleOCR‑VL records a best overall score of 92.86 , surpassing MinerU2.5‑1.2B (90.67) and achieving superior metrics on text (edit distance 0.035), formulas (CDM 91.22), tables (TEDS 90.89, TEDS‑S 94.76), and reading order (0.043).

Paper 5: GOT (Unified OCR‑2.0)

Researchers from StepFun, Megvii, the Chinese Academy of Sciences, and Tsinghua University propose GOT , a 580‑million‑parameter unified end‑to‑end OCR‑2.0 model. It uses a high‑compression encoder and a long‑context decoder to extend recognition from plain text to mathematical formulas, tables, charts, and geometric figures. The model supports slice or full‑page input, formatted outputs (Markdown/TikZ/SMILES), interactive region‑level recognition, dynamic resolution, and multi‑page processing. Training on an 8×8 L40s GPU proceeds in three stages: pre‑training (3 epochs, batch 128, lr 1e‑4), joint training (1 epoch, max token 6000), and post‑training (1 epoch, max token 8192, lr 2e‑5), retaining 80% of the data for the final stage. Benchmark results on ChartQA‑SE and PlotQA‑SE are presented as illustrative examples.

Conclusion

The five papers illustrate a rapid transition from rule‑driven OCR toward integrated visual‑language systems that jointly model vision and language, handle complex layouts, and deliver end‑to‑end semantic understanding, marking a new era for intelligent document processing.

OCR DeepSeek Document Understanding GOT Vision Language Model LightOnOCR PaddleOCR-VL

Written by

HyperAI Super Neural

Deconstructing the sophistication and universality of technology, covering cutting-edge AI for Science case studies.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.

Overview

Paper 1: DeepSeek‑OCR 2 – Visual Causal Flow

Paper 2: LightOnOCR‑2‑1B

Paper 3: HunyuanOCR

Paper 4: PaddleOCR‑VL

Paper 5: GOT (Unified OCR‑2.0)

Conclusion

HyperAI Super Neural

How this landed with the community

Was this worth your time?

0 Comments

Paper 1: DeepSeek‑OCR 2 – Visual Causal Flow