How NoteLLM-2 Boosts Multimodal Recommendations with In-Content Learning
NoteLLM-2 introduces multimodal In-Content Learning and Late Fusion to overcome visual‑modality bias in end‑to‑end fine‑tuned large representation models, delivering significant gains over baseline multimodal LLMs and traditional retrieval methods in recommendation tasks.
Introduction
Large language models (LLMs) excel at text understanding but are rarely applied to multimodal representation tasks such as image‑to‑image recommendation. Two main approaches exist: (1) pre‑train multimodal LLMs, which requires massive high‑quality data and incurs high training cost; (2) fine‑tune an existing LLM together with a frozen vision encoder on domain‑specific data. The second approach, called Multimodal Large Representation Models (MLRMs), is efficient and customizable but often suffers from modality imbalance, favoring text over vision. NoteLLM‑2 addresses this imbalance with a multimodal In‑Content Learning (mICL) strategy and a Late Fusion mechanism.
Dataset Construction
A co‑occurrence note dataset is built from user behavior logs. For each user, if note A is viewed and within a week note B is viewed, the pair (A, B) receives a co‑occurrence count. For each note, other notes are ranked by this count, filtered by a threshold, and the top‑k are kept as related notes.
Baseline MLRM Representation
The baseline compresses each note into a JSON‑style prompt where a placeholder token <IMG> is replaced by a visual embedding. The pipeline is:
Use a vision encoder to extract image features.
Map the visual features to the LLM token space with a connector, producing a visual token embedding.
Tokenize the textual part of the prompt.
Replace the <IMG> token embedding with the visual token embedding.
Feed the full prompt to the LLM and take the final hidden state as the note representation.
Training uses contrastive learning on note pairs with a temperature‑scaled cosine similarity loss:
Loss = -log \frac{\exp(sim(z_i, z_j)/\tau)}{\sum_k \exp(sim(z_i, z_k)/\tau)}Only the connector and any fusion modules are updated; the vision encoder remains frozen.
Baseline Experiments
Zero‑shot evaluation shows that off‑the‑shelf LLMs underperform a BM25 retrieval baseline, confirming the need for fine‑tuning.
Three end‑to‑end MLRMs were trained:
MTomato‑Base: LLM Tomato + CLIP ViT‑B visual encoder + randomly initialized Q‑Former.
MQwen‑Base: replace Tomato with Qwen‑Chat.
MQwen‑bigG: replace the visual encoder with ViT‑bigG.
Two multimodal LLMs (BLIP‑2 and Qwen‑VL‑Chat) serve as first‑type baselines.
Key findings:
End‑to‑end fine‑tuned LLMs outperform their pre‑trained counterparts.
Introducing visual signals into MLRMs yields measurable gains.
Performance improves with larger visual encoders.
Even with these gains, MLRMs still lag behind the strongest multimodal LLMs.
Attention‑score analysis reveals that baseline MLRMs allocate most attention to text in shallow layers, indicating a visual modality bias.
NoteLLM‑2 Method
Multimodal In‑Content Learning (mICL)
The prompt explicitly separates visual and textual content. A special token <IMG_EMB> marks the position of the visual embedding. During the forward pass, the hidden state immediately before <IMG_EMB> is taken as the visual representation, while the final hidden state of the LLM is taken as the multimodal representation. Both representations are optimized during training, but only the final multimodal token is used at inference.
Late Fusion
A small MLP predicts a scalar fusion weight from the original frozen visual embedding and the mICL‑derived visual embedding. The weight is applied to both the original visual embedding and the mICL multimodal embedding, producing a fused multimodal representation that emphasizes visual information.
Training Objective
Note pairs are processed in each batch, and the contrastive loss defined above is applied to the fused multimodal embeddings. Only the connector, mICL, and fusion parameters are updated; the vision encoder stays frozen.
Experimental Validation
Comparison with Baselines
Both mICL and Late Fusion consistently improve retrieval metrics (e.g., recall@K, NDCG) over baseline MLRMs and zero‑shot LLMs.
Modality Importance Analysis
Late Fusion restores visual contribution in shallow network layers, confirming mitigation of the earlier visual bias.
Hyper‑parameter Studies
Experiments varying batch size, temperature τ, and fusion‑weight initialization demonstrate that the approach is robust across a range of settings.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Baobao Algorithm Notes
Author of the BaiMian large model, offering technology and industry insights.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
