Can Item Language Models Bridge LLMs and Collaborative Filtering for Conversational Recommendation?
This paper identifies three challenges of applying large language models to recommendation systems and proposes an Item Language Model that combines an item encoder with a frozen LLM, demonstrating through extensive experiments that language‑item alignment and interaction knowledge significantly improve conversational recommendation performance.
Research Background
Traditional recommendation systems rely on implicit interaction signals (e.g., watch history) rather than explicit natural‑language feedback, making it difficult to compare their performance directly with large language models (LLMs). Conversational recommendation seeks to bridge this gap by allowing users to query the system in natural language. Three key challenges arise when integrating LLMs into conversational recommendation:
Current LLMs are trained only on natural‑language data and cannot jointly model interaction signals and text.
Annotating all relevant items or users leads to excessively long contexts, inflating inference cost.
Mapping collaborative‑filtering embeddings to token embeddings introduces a modality gap that requires additional fine‑tuning.
Method Framework
The proposed Item Language Model (ILM) adopts a two‑stage training paradigm inspired by BLIP‑2’s lightweight query transformer (Q‑Former) to align item representations with LLM token space.
Stage 1 – Q‑Former Pre‑training
In the first stage the Q‑Former encoder is pre‑trained on two objectives:
Item‑text alignment loss : learns to map collaborative‑filtering item embeddings to the same space as textual descriptions.
Contrastive loss : either an item‑item contrastive loss (ILM‑IT‑II) or a user‑item contrastive loss (ILM‑IT‑UI) that regularizes the representations and injects collaborative browsing information. User entities are treated as special items, and their matrix‑factorization embeddings are fed to the item encoder.
A linear projection adapter is then trained to map the Q‑Former outputs to the frozen LLM’s embedding dimension.
Stage 2 – Integration with a Frozen LLM
During multi‑task fine‑tuning on conversational recommendation datasets, only the Q‑Former parameters and the adapter are updated; the LLM backbone remains frozen, preserving its pre‑trained linguistic knowledge.
Experiments
ELM‑24 Benchmark
The ILM is evaluated on the ELM‑24 task, where performance is measured by the SC metric – cosine similarity between decoded text and reference text computed with Sentence‑T5‑11B embeddings. ILM outperforms a strong baseline (CoLLM with a two‑layer MLP) and a variant with a randomly initialized Q‑Former (ILM‑rand).
OpenP5 Benchmark
On the OpenP5 dataset the same two‑stage pipeline is applied. Results show consistent superiority of ILM over baseline methods and over ILM‑rand, confirming the benefit of the item‑language representation learning stage.
Ablation Studies
Impact of Stage 1 : Removing the pre‑training stage (ILM‑rand) degrades performance across all datasets, demonstrating that the Q‑Former must first learn item‑language alignment. Different loss combinations were tested:
ILM‑IT (item‑text only)
ILM‑IT‑II (item‑text + item‑item contrastive)
ILM‑IT‑UI (item‑text + user‑item contrastive)
For the ML‑1M dataset, adding either contrastive loss improves results, whereas for the Beauty and Clothing datasets the effect is marginal, likely due to differing data sparsity.
Effect of Query Token Number : ILM feeds multiple learned query tokens into the Q‑Former, producing several embeddings per item. Experiments varying the number of query tokens show that using multiple tokens generally yields higher SC scores than a single‑embedding MLP baseline.
Conclusion
The paper introduces a two‑stage training paradigm that injects collaborative‑filtering knowledge into a frozen LLM for conversational recommendation. Stage 1 pre‑trains a Q‑Former to encode collaborative embeddings into item‑language aligned vectors; Stage 2 fine‑tunes only the Q‑Former and a linear adapter while keeping the LLM parameters fixed. Extensive experiments on ELM‑24, OpenP5, and several public recommendation datasets demonstrate consistent performance gains and state‑of‑the‑art results, validating the effectiveness of aligning item representations with large language models.
Code example
受BLIP-2用轻量级查询转换器(Q-Former)弥合模态差异这一工作的启发,本文采用Q-Former来生成“物品-语言”对齐的表示。该模型整体架构如下图所示。
其分为两阶段,在第一阶段,如上图(a)(b)所示,先按照BLIP-2方法预训练Q-Former编码器。除了原始的“物品-文本”优化目标之外,还引入了“物品-物品”对比目标,其发挥正则化作用,并在生成的物品语言表示中编码协同浏览信息。上图(c)展现了“物品-物品”对比学习如何改善物品与文本间的对齐效果。整个第一阶段中,协同过滤嵌入将作为物品编码器的输入,同时用户将被视为一个特殊的物品。
第二阶段,如上图(d)所示,通过线性投影适配器层将Q-Former集成到预训练的大语言模型中,并以多任务方式对会话推荐任务进行微调。在微调过程中,只有Q-Former和适配器参数被更新,预训练的LLM被冻结以保持其预训练的能力。Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
