Why No Perfect VLM OCR Exists for Complex Financial Reports – An In‑Depth Model Comparison
The article evaluates several VLM‑based OCR models on complex financial statements, comparing speed, layout accuracy, and handling of irregular tables, and concludes that while some models excel in specific aspects, none yet deliver a flawless solution for all scenarios.
Paper Data
The paper "olmOCR 2: Unit Test Rewards for Document OCR" reports test results for many OCR models, highlighting DeepSeek-OCR, PaddleOCR-VL, Infinity-Parser, and MinerU-VLM as strong performers.
Overall Intuition
For complex‑layout financial reports, the author selected representative examples and observed:
Pipeline‑based models such as MinerU‑Pipeline and PaddleOCR-VL are relatively slow, and sometimes their layout recognition is inferior to end‑to‑end VLM models.
Special‑trained models like DeepSeek-OCR and MinerU-VLM are fast and provide decent layout accuracy.
Models fine‑tuned from large VLMs (Infinity-Parser, Chandra OCR, olmOCR) are large and slow, but generally achieve good detail and layout results.
Thus, if speed is the priority, DeepSeek-OCR or MinerU-VLM are recommended; for a balance of speed and table accuracy, PaddleOCR-VL is a good choice; with ample compute, the Qwen‑VL fine‑tuned series (Infinity‑Parser, Chandra OCR, olmOCR) can be tried.
Case Studies
Case 1 – Blocked Short Paragraphs
PaddleOCR’s layout appears chaotic.
DeepSeek‑OCR shows minor issues with short‑paragraph recognition.
MinerU also struggles with short paragraphs.
Infinity‑Parser performs reasonably well.
Case 2 – Mixed Text and Short Paragraphs
PaddleOCR‑VL layout is messy.
DeepSeek‑OCR shows layout problems with short paragraphs.
MinerU‑VLM is acceptable.
Infinity‑Parser remains solid.
Case 3 – Flowchart with Mixed Font Sizes
PaddleOCR layout is confusing.
DeepSeek‑OCR works reasonably.
MinerU‑VLM loses some small characters.
Infinity‑Parser remains acceptable.
Case 4 – Semi‑Open Table with Mixed Layout
PaddleOCR‑VL still has layout issues but can label tiny table text well.
DeepSeek‑OCR short paragraphs have minor problems.
MinerU‑VLM short paragraphs also show issues.
Infinity‑Parser’s short‑paragraph handling is slightly better, and it recognises small table text well.
Case 5 – Cross‑Row Semi‑Open Table
PaddleOCR‑VL recognises tables well but layout is weak.
DeepSeek‑OCR table recognition degrades significantly.
MinerU‑VLM table results are average.
Infinity‑Parser provides the best semantic column separation.
Case 6 – Frame‑Less Distant Table
PaddleOCR‑VL fails to recognise the table but keeps the overall order.
DeepSeek‑OCR also fails to recognise the table yet preserves order.
MinerU‑VLM does not recognise the table and the order becomes chaotic.
Infinity‑Parser surprisingly succeeds.
Case 7 – Frame‑Less Multi‑Column Complex Table
PaddleOCR handles the table well despite layout issues.
DeepSeek‑OCR makes small table errors.
MinerU‑VLM performs acceptably.
Infinity‑Parser suffers major table errors.
Case 8 – Cross‑Column Table
PaddleOCR handles cross‑column tables correctly.
DeepSeek‑OCR has minor cross‑column errors, misrecognising the last row.
MinerU‑VLM shows no cross‑column issues.
Infinity‑Parser collapses completely on cross‑column tables.
Summary
Overall, VLM‑based OCR for complex financial reports still lacks a perfect solution. Infinity‑Parser demonstrates strong semantic understanding but frequently makes critical OCR mistakes. PaddleOCR‑VL struggles with complex layouts yet excels at table, small‑font, and formula recognition. MinerU‑VLM is fast and performs well in many scenarios, though its semantic grasp lags slightly. DeepSeek‑OCR runs extremely fast with decent layout handling, but fine‑grained details are sometimes missing.
The author plans to release the test code; the models run well on an RTX 4060, with PaddleOCR‑VL and MinerU‑VLM requiring modest attention to configuration.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
AI2ML AI to Machine Learning
Original articles on artificial intelligence and machine learning, deep optimization. Less is more, life is simple! Shi Chunqi
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
