Industry Insights 10 min read

How to Build a Multimodal Web Page Model for the LLM Era

This article examines the unique multimodal and multi‑granular nature of web pages, compares fusion strategies, proposes a cross‑modal attention approach, outlines fine‑ and coarse‑grained pre‑training tasks, and explores low‑cost adaptor methods for adapting large multimodal models to web‑page modeling in the LLM era.

Baidu Geek Talk

Dec 25, 2024

How to Build a Multimodal Web Page Model for the LLM Era

Multimodal and Multi‑Granular Characteristics of Web Pages

Web pages consist of text, images, video, CSS styles and a hierarchical DOM tree. These modalities exist at different granularities: tokens → sentences → paragraphs → DOM nodes of varying depth. Elements across modalities are aligned many‑to‑many (e.g., a DOM node may correspond to a visual layer and to one or more sentences).

Fusion Strategies for Multimodal Features

Four typical fusion designs are compared:

Bottom‑level fusion : Align modalities at the input stage, concatenate token embeddings, and feed them to a transformer. Requires careful weighting of tasks and may struggle to balance modalities.

Top‑level fusion : Encode each modality independently, then concatenate high‑level vectors for downstream tasks. Lacks cross‑modal interaction.

Unified multimodal modeling (e.g., LayoutV2): Encode each modality to fixed‑length vectors and feed them to a shared transformer. Alignment information remains shallow.

Cross‑modal attention (e.g., DocFormer, BEiT‑3): Keep modality‑specific encoders and insert a cross‑attention layer that exchanges information between modalities. Works well when many‑to‑many alignments are present.

When a modality element corresponds to a sequence of elements in another modality, an aggregation function (average‑pooling, LSTM, etc.) can compress the sequence into a fixed‑length vector before the cross‑attention layer.

Pre‑training Tasks for Web‑Page Modeling

Four categories of pre‑training objectives cover fine‑grained and coarse‑grained semantics, structure, and visual information.

Fine‑grained semantic : Mask an entire sentence and reconstruct it with a decoder (sentence‑level reconstruction).

Coarse‑grained semantic : Mask the page title, generate a pseudo query from click‑log data, and reconstruct both.

Fine‑grained structure/visual : Mask or reorder HTML tags and DOM nodes; the model learns to recover the original order.

Coarse‑grained structure/visual : Use a large language model (e.g., GPT) to generate pseudo page‑type labels and train a classifier on the whole page.

Adaptor Network for Large Multimodal LLMs

Directly feeding raw HTML (average length ≈ 160 k tokens) into a large model is infeasible. An adaptor network can convert HTML‑DOM structure, layout coordinates, and visual cues into a small set of fixed‑length vectors compatible with existing multimodal LLMs.

Training procedure:

Freeze the LLM or use a very low learning rate to preserve its generalization.

Insert special tokens (e.g., <ADAPTOR>) into the prompt to represent adaptor outputs.

Jointly train the adaptor so that its token embeddings align with the LLM’s semantic space.

Example prompt: “Here is a web‑page DOM node representation <ADAPTOR> , output CSS description: style='xxxxx'.”

This approach enables low‑cost adaptation of powerful multimodal models to downstream web tasks such as quality assessment, structural parsing, or CSS generation.