How M2Doc Boosts Document Layout Analysis with Plug‑in Multimodal Fusion

This article introduces M2Doc, a plug‑in multimodal fusion approach that equips visual‑only object detectors with textual and semantic awareness, detailing its early‑ and late‑fusion modules, experimental validation on DocLayNet, M6Doc and PubLayNet, and future research directions.

Alibaba Cloud Big Data AI Platform
Alibaba Cloud Big Data AI Platform
Alibaba Cloud Big Data AI Platform
How M2Doc Boosts Document Layout Analysis with Plug‑in Multimodal Fusion

Introduction

Document layout analysis is a key task in document intelligence, but most existing methods rely solely on visual features and ignore textual cues. Recent pretrained document models have succeeded in downstream tasks, yet they are simply fine‑tuned on visual detectors for layout analysis.

Motivation

Because layout analysis targets text regions that inherently possess both visual and textual attributes, a multimodal modeling approach is more appropriate. Moreover, textual instances often have semantic relationships, and current visual‑only detectors struggle with complex logical layouts.

M2Doc Framework

The proposed plug‑in multimodal fusion method M2Doc endows single‑modality detectors with multimodal perception. It consists of two fusion modules: Early‑Fusion and Late‑Fusion.

Text Grid Representation

Given a document image and OCR results, each word is ordered and fed into a pretrained BERT model to obtain embeddings. These embeddings are placed back into the corresponding OCR boxes, forming a text‑grid input that aligns with the image at the pixel level.

Feature Extraction

A ResNet backbone extracts multi‑scale visual and textual features from the aligned inputs.

Early‑Fusion

A gate‑like mechanism fuses visual and textual features at each scale before proposal generation, followed by LayerNorm to normalize the fused features.

Late‑Fusion

After proposals are generated, a simple weighted addition combines the visual and text features of each box, effectively integrating multimodal information.

Experiments

Extensive experiments on DocLayNet, M6Doc, and PubLayNet show that adding M2Doc to detectors such as Cascade Mask R‑CNN and DINO achieves state‑of‑the‑art results, outperforming existing multimodal baselines. Plug‑in experiments demonstrate consistent improvements on both two‑stage and end‑to‑end detectors, confirming M2Doc’s generalization and plug‑in capability. Ablation studies validate the effectiveness of each fusion component.

Conclusion and Future Work

M2Doc provides a universal, lightweight multimodal fusion solution that significantly enhances document layout analysis in complex logical scenarios. Future directions include designing more efficient unified multimodal models, exploring better fusion strategies, and simplifying dense text representation pipelines.

M2Doc Overview
M2Doc Overview
Motivation Illustration
Motivation Illustration
M2Doc Framework
M2Doc Framework
Main Experiment Results
Main Experiment Results
Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

AIobject detectiondocument layout analysismultimodal fusionM2Doc
Alibaba Cloud Big Data AI Platform
Written by

Alibaba Cloud Big Data AI Platform

The Alibaba Cloud Big Data AI Platform builds on Alibaba’s leading cloud infrastructure, big‑data and AI engineering capabilities, scenario algorithms, and extensive industry experience to offer enterprises and developers a one‑stop, cloud‑native big‑data and AI capability suite. It boosts AI development efficiency, enables large‑scale AI deployment across industries, and drives business value.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.