Artificial Intelligence 15 min read

LayoutMask: Enhancing Text-Layout Interaction in Multi-modal Pre-training for Document Understanding

LayoutMask introduces a novel multi-modal pre‑training model that replaces global 1D position with local 1D position and adds Whole Word Masking, Layout‑Aware Masking, and Masked Position Modeling, achieving state‑of‑the‑art results on various visually‑rich document understanding tasks.

AntTech
AntTech
AntTech
LayoutMask: Enhancing Text-Layout Interaction in Multi-modal Pre-training for Document Understanding

The paper presents LayoutMask, a new multi‑modal pre‑training model designed to improve visually‑rich document understanding (VrDU) by addressing the reading‑order problem inherent in existing models that rely on global 1D positional encodings.

Instead of using a global 1D position, LayoutMask adopts a local 1D position that only encodes the order of text within each OCR‑detected segment, and combines it with segment‑level 2D positional information. This forces the model to infer global reading order from both semantic and spatial cues, making it more robust to layout disturbances such as segment swaps.

Two enhanced masked language modeling strategies are introduced: Whole Word Masking (WWM) and Layout‑Aware Masking (LAM). WWM masks whole words rather than individual tokens, increasing task difficulty, while LAM increases the masking probability of the first and last words of each segment to encourage cross‑segment context learning.

A novel auxiliary pre‑training task, Masked Position Modeling (MPM), predicts masked 2D positions of words. To prevent information leakage, selected words are split into separate segments and their original 2D positions are replaced with pseudo positions; the model then learns to recover the true positions using a GIoU loss.

Extensive experiments on benchmark datasets (FUNSD, CORD, SROIE for form and receipt understanding; RVL‑CDIP for document classification) show that LayoutMask’s Base and Large variants achieve state‑of‑the‑art F1 scores and classification accuracy, often surpassing models that also use image modality.

Ablation studies demonstrate that using local 1D positions greatly improves robustness against layout perturbations such as segment swaps, confirming the advantage of the proposed positional encoding.

The authors report successful deployment of LayoutMask in multiple Ant Group business scenarios, including qualification parsing, ID extraction, and mini‑program page understanding, serving millions of users and reducing manual verification time.

AINLPDocument Understandinglayout maskingMultimodal Pretraining
AntTech
Written by

AntTech

Technology is the core driver of Ant's future creation.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.