Overview of Document Intelligence Models: StrucText, LayoutLMv3, and GraphDoc
This article reviews three representative document intelligence models—StrucText, LayoutLMv3, and GraphDoc—detailing their input features, feature fusion strategies, self‑supervised tasks, and underlying architectures, and explains how they learn embeddings for segments, words, or regions to enable classification and key‑value extraction.
1. Introduction
Document images contain multiple textual entries (Segments), words, or regions. The core challenges for document intelligence are (1) predicting the category of each Segment/Word/Region and (2) predicting the key‑value pairing relationship between them.
2. Problem Decomposition
Learn high‑quality embeddings for Segments (Words, Regions).
Use the learned embeddings for classification to predict categories.
Compute similarity between embeddings to predict pairing relationships, assuming paired Segments have high similarity.
3. Model Overview
The article selects three representative papers—StrucText, LayoutLMv3, and GraphDoc—and provides a brief introduction to their core technologies. All three models employ self‑supervised learning to pre‑train embeddings for Segments, Words, or Regions, followed by fine‑tuning on domain‑specific data for classification or key‑value prediction.
4. StrucText
Input Features
StrucText combines text, image, segment‑index, character‑length, and modality features into a single sequence. Text is obtained via OCR, providing both the string and the bounding box coordinates (x₀, y₀, x₁, y₁) for each Segment, from which width (w) and height (h) are derived.
Formulas
Segment‑index (S) encodes the order of Segments sorted by their top‑left coordinates. Layout encoding (L) embeds the coordinates of each Segment. Image features (V) are extracted using a ResNet‑50+FPN backbone applied to the image region corresponding to each Segment.
Feature Fusion
All features are concatenated into a sequence and processed by multiple Transformer layers with multi‑head self‑attention, producing a learned embedding for each token.
Self‑Supervised Tasks
MLM (masked language modeling) – predict masked words.
SLP (segment length prediction) – predict the number of words in a Segment.
PBD (relative direction prediction) – classify the relative direction between two Segments into eight categories.
5. LayoutLMv3
Input Features
Text features – RoBERTa embeddings for OCR‑extracted words.
1D Layout – shared positional embeddings for both text and image tokens.
2D Layout – embeddings of the bounding box (x, y, w, h) shared by all words in a Segment.
Image features – Vision Transformer (ViT) patches extracted from the document image.
Text and image sequences are each augmented with their respective layout embeddings and then concatenated into a single sequence.
Feature Fusion
Multi‑head self‑attention computes correlations between all tokens, with additional relative position parameters (1D, 2D‑x, 2D‑y) incorporated into the attention scores.
Self‑Supervised Tasks
MLM – same as StrucText, with Poisson‑distributed masking.
MIM – mask image patches and predict their discrete token IDs using a BIET‑style decoder.
WPA – predict whether the image patch corresponding to a masked word is also masked.
6. GraphDoc
Input Features
Text – Sentence‑BERT embeddings of each Region’s text, combined with layout encoding.
Image – Swin‑Transformer + FPN backbone, with RoIAlign on region‑specific patches (P2).
Feature Fusion
Fusion occurs at two levels: (1) intra‑Region fusion using an attention gate to combine text and visual features, and (2) inter‑Region fusion via a Graph Neural Network (GNN) that propagates information across Regions using a learned adjacency matrix enriched with 2D positional encodings.
Self‑Supervised Task
Randomly mask a Region’s text with a special token, forward it through the GNN, and compute a Smooth‑L1 loss between the GNN‑produced Region representation and the original Sentence‑BERT embedding.
7. Summary
The three surveyed models demonstrate that effective multimodal feature fusion and capturing relationships among Segments/Words/Regions enable the learning of robust representations for downstream tasks such as classification and key‑value extraction in intelligent document processing.
These techniques form the backbone of Laiye’s document‑intelligence product, and future blog posts will detail internal adaptations of these architectures.
References
https://arxiv.org/abs/2108.02923
https://arxiv.org/abs/2204.08387
https://arxiv.org/abs/2203.13530
https://mp.weixin.qq.com/s/WrCDYuvHw-QPMzzRHdOHeA
https://arxiv.org/abs/1706.03762
https://arxiv.org/abs/2106.08254
https://arxiv.org/abs/1805.07445
https://github.com/ibm-aur-nlp/PubLayNet
https://zhuanlan.zhihu.com/p/73138740
https://arxiv.org/abs/2012.14740
https://arxiv.org/abs/2103.14470
https://baike.baidu.com/item/邻接矩阵/9796080
https://blog.csdn.net/luzaijiaoxia0618/article/details/104718146/
https://mage.laiye.com/
Laiye Technology Team
Official account of Laiye Technology, featuring its best tech innovations, practical implementations, and cutting‑edge industry insights.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.