How Multimodal Large Models Transform Recommendation Systems: From Tags to Embeddings
This article explores how multimodal large models like Qwen2.5‑VL enable high‑dimensional tag generation and universal embeddings for recommendation systems, detailing data synthesis, model training, quantization, fine‑tuning, and the resulting improvements in click‑through rate and exposure interaction.
Background
Recommendation systems have evolved from rule‑based methods and collaborative filtering to deep‑learning models. The rise of large language models brings new capabilities: powerful semantic understanding, multimodal perception, zero‑ and few‑shot learning, external knowledge integration, and cross‑domain generalization. Content platforms such as Zhihu face challenges in multimodal content understanding and cold‑start problems.
This work builds a full‑chain solution based on multimodal large models (e.g., Qwen2.5‑VL) to move from explicit tags to implicit representations, aiming to break modality barriers in recommendation.
Multimodal Content Understanding with Large Models
Two core outputs are produced: high‑dimensional tags and high‑dimensional vectors. Using models like Qwen2.5‑VL‑72B, we extract features from images, text, and video to generate explicit tags and, through contrastive learning and data synthesis, construct multimodal vectors.
Multimodal High‑dimensional Explicit Tags
These tags form a fine‑grained, high‑accuracy, open‑set label system that automatically updates by combining textual and visual information.
Tag Exploration
Various base models were evaluated; Qwen2.5‑VL‑72B‑Instruct performed best on benchmarks covering document understanding, visual QA, video understanding, and visual agents, showing strong zero‑shot capabilities.
Data Synthesis and Training
Qwen2.5‑VL‑72B was selected for data synthesis, generating ~20k textual ideas and 6.5k videos, followed by manual relevance labeling. Training follows the pipeline illustrated below.
Annotations are categorized as strong relevance (2), weak relevance (1), and irrelevant (0). Fine‑tuning on Qwen2.5‑VL‑7B and Qwen2.5‑VL‑3B shows that the 3‑B model achieves 80.23% accuracy on image‑based ideas.
Multimodal Universal Implicit Vectors
We adopt Qwen2‑VL as the base multimodal model, which integrates a visual encoder, a mapping module, and an autoregressive language model. By introducing 2D rotary position encoding and multimodal rotary position encoding (M‑RoPE), the model handles dynamic resolutions and aligns text, image, and video modalities. Since the original training objective focuses on language generation, we fine‑tune the model for embedding learning.
Training uses LoRA (rank = 16) on a 7‑B base model, updating parameters of the visual encoder, mapper, and language model.
Data Synthesis Strategies
1) Generate Chinese query‑positive‑sample pairs from source images using Qwen2.5‑VL‑72B, prompting the model to describe the image, then create queries, positives, and hard negatives, followed by self‑checking and refinement.
2) Use the trained model (M1) to retrieve candidate items, then re‑rank with Qwen2.5‑VL‑72B to generate query‑positive‑negative triples, using relevance scores to select positives.
Evaluation
We translated the MMEB benchmark to Chinese and evaluated on classification, retrieval, visual QA, and localization tasks. Compared with the GME 7B baseline, our approach improves overall MMEB‑eval‑zh score from 49.7% to 53.4% (+7.4%).
Application in Recommendation Scenarios
High‑dimensional explicit tags are used for new‑content tag recall, increasing exposure interaction rate by 3.26%. The tags also serve as ID‑type features in ranking models, improving long‑tail behavior modeling.
Multimodal vectors are integrated as ID features after quantization via RQ‑VAE, preserving semantic information while being learnable in embedding layers. Quantization offers flexible clustering granularity, low‑cost integration, and easy updates.
We also fine‑tune multimodal representations using collaborative filtering signals. Two approaches:
Use multimodal features as the sole content side input to CTR models.
Construct positive pairs from itemCF and dual‑tower item representations, sample negatives, and apply contrastive learning.
Offline Evaluation
Recall@TopK shows that fine‑tuned representations maintain visual similarity while enhancing collaborative relevance.
Online Gains
Applying fine‑tuned representations to new‑content neighbor recall yields a 2.8% CTR lift and a 9.2% increase in exposure interaction rate.
Conclusion and Outlook
Large models bring strong semantic understanding and cross‑domain generalization to recommendation systems. This study demonstrates the practical value of multimodal large models in Zhihu’s recommendation pipeline, achieving significant gains through multimodal tag generation, residual quantization, and dual‑driven fine‑tuning.
Zhihu Tech Column
Sharing Zhihu tech posts and exploring community technology innovations.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
