How Multimodal Large Models Transform Recommendation Systems: From Tags to Embeddings

This article explores how multimodal large models like Qwen2.5‑VL enable high‑dimensional tag generation and universal embeddings for recommendation systems, detailing data synthesis, model training, quantization, fine‑tuning, and the resulting improvements in click‑through rate and exposure interaction.

Zhihu Tech Column
Zhihu Tech Column
Zhihu Tech Column
How Multimodal Large Models Transform Recommendation Systems: From Tags to Embeddings

Background

Recommendation systems have evolved from rule‑based methods and collaborative filtering to deep‑learning models. The rise of large language models brings new capabilities: powerful semantic understanding, multimodal perception, zero‑ and few‑shot learning, external knowledge integration, and cross‑domain generalization. Content platforms such as Zhihu face challenges in multimodal content understanding and cold‑start problems.

This work builds a full‑chain solution based on multimodal large models (e.g., Qwen2.5‑VL) to move from explicit tags to implicit representations, aiming to break modality barriers in recommendation.

Multimodal Content Understanding with Large Models

Two core outputs are produced: high‑dimensional tags and high‑dimensional vectors. Using models like Qwen2.5‑VL‑72B, we extract features from images, text, and video to generate explicit tags and, through contrastive learning and data synthesis, construct multimodal vectors.

Multimodal High‑dimensional Explicit Tags

These tags form a fine‑grained, high‑accuracy, open‑set label system that automatically updates by combining textual and visual information.

Tag Exploration

Various base models were evaluated; Qwen2.5‑VL‑72B‑Instruct performed best on benchmarks covering document understanding, visual QA, video understanding, and visual agents, showing strong zero‑shot capabilities.

Data Synthesis and Training

Qwen2.5‑VL‑72B was selected for data synthesis, generating ~20k textual ideas and 6.5k videos, followed by manual relevance labeling. Training follows the pipeline illustrated below.

Annotations are categorized as strong relevance (2), weak relevance (1), and irrelevant (0). Fine‑tuning on Qwen2.5‑VL‑7B and Qwen2.5‑VL‑3B shows that the 3‑B model achieves 80.23% accuracy on image‑based ideas.

Multimodal Universal Implicit Vectors

We adopt Qwen2‑VL as the base multimodal model, which integrates a visual encoder, a mapping module, and an autoregressive language model. By introducing 2D rotary position encoding and multimodal rotary position encoding (M‑RoPE), the model handles dynamic resolutions and aligns text, image, and video modalities. Since the original training objective focuses on language generation, we fine‑tune the model for embedding learning.

Training uses LoRA (rank = 16) on a 7‑B base model, updating parameters of the visual encoder, mapper, and language model.

Data Synthesis Strategies

1) Generate Chinese query‑positive‑sample pairs from source images using Qwen2.5‑VL‑72B, prompting the model to describe the image, then create queries, positives, and hard negatives, followed by self‑checking and refinement.

2) Use the trained model (M1) to retrieve candidate items, then re‑rank with Qwen2.5‑VL‑72B to generate query‑positive‑negative triples, using relevance scores to select positives.

Evaluation

We translated the MMEB benchmark to Chinese and evaluated on classification, retrieval, visual QA, and localization tasks. Compared with the GME 7B baseline, our approach improves overall MMEB‑eval‑zh score from 49.7% to 53.4% (+7.4%).

Application in Recommendation Scenarios

High‑dimensional explicit tags are used for new‑content tag recall, increasing exposure interaction rate by 3.26%. The tags also serve as ID‑type features in ranking models, improving long‑tail behavior modeling.

Multimodal vectors are integrated as ID features after quantization via RQ‑VAE, preserving semantic information while being learnable in embedding layers. Quantization offers flexible clustering granularity, low‑cost integration, and easy updates.

We also fine‑tune multimodal representations using collaborative filtering signals. Two approaches:

Use multimodal features as the sole content side input to CTR models.

Construct positive pairs from itemCF and dual‑tower item representations, sample negatives, and apply contrastive learning.

Offline Evaluation

Recall@TopK shows that fine‑tuned representations maintain visual similarity while enhancing collaborative relevance.

Online Gains

Applying fine‑tuned representations to new‑content neighbor recall yields a 2.8% CTR lift and a 9.2% increase in exposure interaction rate.

Conclusion and Outlook

Large models bring strong semantic understanding and cross‑domain generalization to recommendation systems. This study demonstrates the practical value of multimodal large models in Zhihu’s recommendation pipeline, achieving significant gains through multimodal tag generation, residual quantization, and dual‑driven fine‑tuning.

multimodal AIlarge language modelsEmbeddingRecommendation Systemscontent tagging
Zhihu Tech Column
Written by

Zhihu Tech Column

Sharing Zhihu tech posts and exploring community technology innovations.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.