Artificial Intelligence 10 min read

XYLayoutLM: Towards Layout-Aware Multimodal Networks for Visually-Rich Document Understanding

XYLayoutLM introduces a layout‑aware multimodal network that improves visually‑rich document understanding by augmenting XY‑Cut for robust reading order generation and employing a Dilated Conditional Position Encoding to handle variable‑length inputs, achieving state‑of‑the‑art performance on XFUN and FUNSD datasets.

AntTech
AntTech
AntTech
XYLayoutLM: Towards Layout-Aware Multimodal Networks for Visually-Rich Document Understanding

The paper presents XYLayoutLM, a multimodal document understanding model designed to address errors caused by complex form structures and long text sequences in automated reading tasks. It builds upon the CVPR‑2022 accepted work from Ant Group and Shanghai Jiao Tong University.

It first defines a "proper reading order" for documents, highlighting challenges where OCR‑detected text boxes cannot be simply sorted left‑to‑right or top‑to‑bottom due to hierarchical layouts and noisy OCR outputs.

The core contributions are two novel modules: (1) an Augmented XY Cut algorithm that generates multiple reasonable reading orders by adding small random perturbations to bounding boxes, improving robustness against OCR noise; (2) a Dilated Conditional Position Encoding (DCPE) that separates text and image tokens, applies 1‑D convolutions to text and 2‑D convolutions to image, and uses dilated convolutions to capture long‑range dependencies without extra computational cost.

XYLayoutLM integrates visual features from a ResNeXt‑101 backbone, textual embeddings, and positional embeddings (both box and positional encodings) before feeding the concatenated tokens into a self‑attention transformer. The model is evaluated on the XFUN and FUNSD datasets, where it surpasses the baseline LayoutXLM by approximately 2% F1 score while maintaining comparable parameter counts.

Ablation studies confirm the effectiveness of both the Augmented XY Cut and DCPE modules, and experiments on LayoutXLM demonstrate the impact of different reading‑order strategies. Visualizations illustrate how the augmented ordering yields more logical text sequences.

The authors conclude that XYLayoutLM significantly advances multimodal document understanding and is already deployed in Ant Group's automated form‑processing system. Future work includes predicting reading order directly, extending to mini‑program page understanding, and further enhancing visual feature modeling.

multimodalVision TransformerDocument Understandingposition-encodinglayout-awareXYCut
AntTech
Written by

AntTech

Technology is the core driver of Ant's future creation.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.