How Structured Input Boosts Multimodal LLMs in Document QA Without Retraining

This article presents a training‑free, architecture‑agnostic method that leverages LaTeX‑style structured inputs to preserve document hierarchy and spatial relationships, thereby improving multimodal large language model performance on document question answering tasks across multiple benchmarks.

vivo Internet Technology
vivo Internet Technology
vivo Internet Technology
How Structured Input Boosts Multimodal LLMs in Document QA Without Retraining

Introduction

Multimodal large language models (MLLMs) have advanced, yet document understanding remains challenging due to the need to process text, tables, and images. This work proposes a training‑free, architecture‑agnostic method that improves DocQA performance by feeding structured inputs that preserve document hierarchy and spatial relationships.

Core Challenges in Document Understanding

Existing approaches focus on expanding context windows or retrieval‑augmented generation, ignoring how input format affects model comprehension. Unstructured OCR text often scatters attention and degrades accuracy, e.g., accuracy on MMLongBench drops from 0.389 to 0.370.

Innovative Method: Structured Input and Attention Analysis

The proposed LaTeX‑style encoding retains headings, tables, and image positions. The pipeline includes:

Structured encoding : Prompt MLLMs to output LaTeX representations that keep document structure.

Joint input : Feed both the structured text and original images to the model.

Attention analysis : Compare attention maps for image‑only, image+unstructured text, and image+structured text inputs, showing that structured input reduces wasted attention and focuses on semantic regions.

Experimental Validation

On four benchmarks (MMLongBench, LongDocUrl, PaperTab, FetaTab) and four MLLMs (Qwen2.5‑VL‑7B‑Instruct, Phi‑3.5‑Vision‑Instruct, etc.), structured input consistently raises accuracy; e.g., Qwen2.5‑VL‑7B‑Instruct improves from 0.389 to 0.435 on MMLongBench, and gains up to 20 % on PaperTab.

Deep Insight via Attention Mechanism

Attention visualizations reveal that unstructured text causes diffuse attention across image borders, while structured text induces “structured attention” that concentrates on charts and relevant text blocks, markedly improving answer correctness.

Conclusion and Outlook

The study demonstrates that input formatting critically influences MLLM document understanding. A simple, training‑free structured input approach offers a practical way to enhance performance for intelligent document processing and automated QA, and suggests future work on advanced structure extraction or attention‑control plugins.

AImultimodal LLMDocument Understandingattention analysisDocQAstructured input
vivo Internet Technology
Written by

vivo Internet Technology

Sharing practical vivo Internet technology insights and salon events, plus the latest industry news and hot conferences.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.