FinixDoc Tackles Hard Financial Document Parsing with a 4B Model and Open‑Source Benchmark

FinixDoc is an end‑to‑end financial document parsing system built on a 4‑billion‑parameter Qwen3‑VL model that outperforms open‑source baselines on the newly released FinixDocBench, handling low‑quality, complex and ultra‑large documents through a specialized training pipeline and evaluation matrix.

AntTech
AntTech
AntTech
FinixDoc Tackles Hard Financial Document Parsing with a 4B Model and Open‑Source Benchmark

FinixDoc is an end‑to‑end intelligent system for financial document parsing. Its core model, FinixDoc‑VL, is fine‑tuned from Qwen3‑VL‑4B and achieves an overall score of 81.43 on the FinixDocBench benchmark, surpassing the second‑place open‑source model by 5.13 points.

The system addresses the gap between clean, digitally native PDFs and real‑world financial documents such as phone‑captured receipts, insurance policies, and massive tables, where issues like shadows, blur, perspective distortion, and extreme aspect ratios make reliable structured output a challenge.

To characterize these challenges, the authors propose a Document Parsing Capability Matrix. The horizontal axis measures image quality (digital native → scanned → phone‑captured → blurred/occluded), while the vertical axis measures layout scale (single page → ultra‑long pages → large dense tables). The matrix defines four zones: Benchmark‑Converged, Low‑Quality, Underexplored Large‑Scale, and Ambiguous‑Unrecoverable.

Training focuses on visual confusion specific to financial documents. From roughly 100 k real financial pages the team extracted about 4 500 high‑frequency characters and generated 5‑20 visually similar alternatives for each. For every correct sample they create 10‑30 hard‑negative samples, resulting in approximately 2 million triplet training examples for contrastive learning.

The second training stage uses reinforcement learning with the GRPO algorithm. It leverages 100 k pages of financial domain data and 100 k pages of public‑domain data. The reward combines three dimensions: text fidelity (edit distance for text, TEDS for tables), detection quality (JSON categories and bounding‑box accuracy), and reading order (constraints for multi‑column, mixed layout, and long documents).

FinixDoc’s data pipeline, called Data Factory, operates in three stages. First, multi‑model collaborative pre‑labeling combines PP‑DocLayoutV3, FinixDoc‑VL, and Qwen3‑VL‑235B‑A22B‑Instruct, merging results based on category compatibility, spatial overlap, and text similarity. Second, large‑model calibration and refinement use Kimi‑K2.5 to correct OCR errors, structural inconsistencies, and hallucinations. Third, rule‑based verification and quality routing filter samples into automatic pass, spot‑check, or expert review queues, ultimately yielding about 10 k validated pages (≈1 % of the raw material).

FinixDocBench comprises roughly 5 000 pages divided into four subsets: FinixDigital (digital native insurance clauses), FinixPhoto (phone‑captured medical receipts), FinixHuge (ultra‑long documents and massive tables), and FinixInner (internal workflow documents). The open‑source subset contains 742 pages with image + Markdown + JSON annotations; 542 pages also provide structured JSON. The largest single page reaches 386 M pixels.

Evaluation results show FinixDoc‑VL scoring 81.43 overall, 93.19 on FinixDigital, 67.03 on FinixPhoto, and 84.08 on FinixInner. On the low‑quality FinixPhoto subset it improves over the base Qwen3‑VL‑4B by 12.75 points, outperforming Kimi‑K2.5 and Qwen3‑VL‑235B‑A22B‑Instruct. For ultra‑large FinixHuge images, a split‑then‑merge strategy achieves a 92 % success rate, compared with 68 % and 34 % for the two baselines.

In summary, FinixDoc demonstrates that a 4‑billion‑parameter vision‑language model, combined with domain‑specific contrastive learning, reinforcement learning, and a human‑in‑the‑loop data factory, can reliably parse low‑quality, complex, and ultra‑large financial documents, and the released FinixDocBench provides a reproducible benchmark for future research.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

contrastive learningreinforcement learningQwen3-VL-4Bdata factoryfinancial document parsingFinixDocBench
AntTech
Written by

AntTech

Technology is the core driver of Ant's future creation.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.