FireRed-OCR 2B: An Open‑Source VLM That Tackles Structural Hallucination
FireRed‑OCR‑2B, an open‑source 2‑billion‑parameter visual‑language model, addresses structural hallucination in document OCR through a geometry‑aware data factory and a three‑stage training pipeline, achieving a 92.94 OmniDocBench v1.5 score and leading end‑to‑end performance while remaining lightweight enough for consumer‑grade GPUs.
Introduction
FireRedTeam released the weights of FireRed-OCR-2B on 2026‑02‑28 and posted the technical report on arXiv on 2026‑03‑02. The model is positioned not as another generic OCR attempt but as a solution to the "structural hallucination" problem that plagues general‑purpose visual‑language models (VLMs) when parsing complex documents.
Why Structural Hallucination Matters
When using VLMs for PDF‑to‑Markdown conversion, users often find that plain text is recognized reasonably well, but tables, formulas, nested headings, and irregular layouts are mis‑aligned, missing brackets, or produce incorrect reading order. This phenomenon, termed Structural Hallucination in the paper, makes the output unusable for downstream tasks such as RAG, knowledge‑base cleaning, or financial‑report extraction.
What FireRed-OCR Tries to Solve
The authors argue that the core of OCR is not merely character accuracy but the integrity of the extracted structure. FireRed-OCR therefore treats structure as a first‑class objective and builds a framework that forces a VLM to produce well‑formed, program‑consumable markup.
Three Key Contributions
Geometry + Semantics Data Factory : Instead of random sampling, the training data are curated with geometric clustering and multi‑dimensional labels to balance long‑tail layouts, rare document types, and challenging formats such as multi‑column text, nested tables, and scanned noise.
Three‑Stage Training Pipeline :
Multi‑task Pre‑alignment : Detect, recognize regions, and perform layout‑to‑Markdown tasks to give the model spatial grounding.
Specialized SFT : Fine‑tune on high‑quality, standardized Markdown data to stabilize the “output a full page of structured results” format.
Format‑Constrained GRPO : Apply reinforcement learning with format‑aware rewards that penalize formula syntax errors, table closure mistakes, hierarchy violations, and text inaccuracies.
Structural Constraints as Objective : By rewarding correct LaTeX formulas, complete tables, and proper reading order, the model optimizes for outputs that can be directly consumed by downstream programs rather than just looking like text.
Benchmark Results
On OmniDocBench v1.5 , FireRed‑OCR‑2B scores 92.94, ranking first in the end‑to‑end track and surpassing DeepSeek‑OCR 2 (91.09) and OCRVerse (88.56). Compared with its base model Qwen3‑VL‑2B‑Instruct (81.87), the improvement is substantial. When pipeline methods are included, GLM‑OCR reaches 94.60 and PaddleOCR‑VL‑1.5 reaches 94.50, so FireRed‑OCR is not the overall leaderboard champion but is the best pure end‑to‑end 2B model.
Additional metrics on OmniDocBench v1.5 include:
Character edit distance: 0.032
Formula score: 91.71
Table TEDS: 90.31
Table TEDS_s: 93.81
Reading‑order edit distance: 0.041
On the more challenging FireRedBench (wild‑field complex documents), FireRed‑OCR‑2B attains 74.62, beating the same base model (65.58) and DeepSeek‑OCR 2 (61.61), indicating robustness beyond benchmark‑specific tuning.
Installation
pip install transformers
pip install qwen-vl-utils
git clone https://github.com/FireRedTeam/FireRed-OCR.git
cd FireRed-OCRThe model is hosted on Hugging Face under an Apache‑2.0 license, with Qwen/Qwen3-VL-2B-Instruct as the backbone.
Usage Example
from transformers import Qwen3VLForConditionalGeneration, AutoProcessor
from conv_for_infer import generate_conv
model = Qwen3VLForConditionalGeneration.from_pretrained(
"FireRedTeam/FireRed-OCR",
torch_dtype=torch.bfloat16,
device_map="auto",
)
processor = AutoProcessor.from_pretrained("FireRedTeam/FireRed-OCR")
image_path = "./examples/complex_table.png"
messages = generate_conv(image_path)
inputs = processor.apply_chat_template(
messages,
tokenize=True,
add_generation_prompt=True,
return_dict=True,
return_tensors="pt",
)
inputs = inputs.to(model.device)
generated_ids = model.generate(**inputs, max_new_tokens=8192)
# post‑process to obtain structured Markdown
output_text = processor.batch_decode(
generated_ids_trimmed,
skip_special_tokens=True,
clean_up_tokenization_spaces=False,
)
print(output_text)For multi‑image or video scenarios the authors recommend enabling flash_attention_2 to improve speed and memory usage. They note that the current official inference script uses the transformers backend; large‑scale serving may require vLLM, SGLang, or a custom API server.
Author’s Assessment
The author concludes that the project is worth following, not because of raw scores alone but because of its engineering methodology: clear task definition, balanced data distribution, and a reward function that aligns with real‑world structural correctness. Limitations include that FireRed‑OCR‑2B is not the absolute top across all tracks, the ecosystem is still early, and production‑grade stability depends on community feedback on Chinese documents, scanned invoices, financial reports, and ultra‑long PDFs.
Old Zhang's AI Learning
AI practitioner specializing in large-model evaluation and on-premise deployment, agents, AI programming, Vibe Coding, general AI, and broader tech trends, with daily original technical articles.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
