Aligning Collaborative Filtering with LLM Token Generation: The TCA4Rec Breakthrough
This paper introduces the TCA4Rec framework that directly aligns item‑level collaborative‑filtering preferences with token‑level objectives of large language models, presenting novel modules, extensive experiments, and analysis that demonstrate significant performance gains in generative recommendation tasks.
Background
Large language models (LLMs) excel at semantic modeling but cannot directly learn collaborative filtering (CF) behavioral signals because CF provides item‑level preferences while LLMs are trained with token‑level next‑token prediction (NTP). This mismatch limits recommendation performance.
Method
TCA4Rec aligns CF item‑level logits with LLM token‑level objectives via two modules.
Collaborative Tokenizer
Given the LLM generation step j, the tokenizer:
Collects candidate items whose textual representation shares the current token prefix, ensuring only feasible continuations are considered.
Applies softmax to the CF logits of these candidates to obtain an item‑level probability distribution.
Aggregates probabilities of items that map to the same next token, producing a token‑level CF distribution that is directly compatible with the LLM vocabulary.
Soft Label Alignment
The token‑level CF distribution P_cf(t) is combined with the one‑hot ground‑truth token label y_onehot(t) using a weighting factor α∈[0,1]: y_soft = (1‑α) * y_onehot + α * P_cf The resulting soft label is used in the cross‑entropy loss, allowing the model to balance semantic fluency (LLM) and collaborative consistency (CF).
The framework is model‑agnostic: any CF model (e.g., SASRec, BERT4Rec) can provide logits, and any decoder‑based generative recommender (e.g., TallRec, LLaRA, CoLLM, MSL) can consume the soft labels without architectural changes.
Experiments
Evaluations were performed on three public recommendation datasets (Toys, Sports, Office) using four LLM‑based generative backbones (TallRec, LLaRA, CoLLM, MSL). The primary metrics were NDCG@5 and Hit@5.
Across all dataset‑model combinations, integrating TCA4Rec yielded consistent improvements (e.g., +3–5% absolute NDCG@5). Model‑agnosticism was verified by applying TCA4Rec to semantic‑ID generators (TIGER, LETTER), which also showed notable gains.
Two ablation studies were conducted:
Removing the Collaborative Tokenizer (using only the one‑hot label) reduced performance, confirming the necessity of token‑level CF distribution.
Removing Soft Label Alignment (using only the CF distribution) also degraded results, demonstrating the importance of blending semantic and collaborative signals.
Conclusion
TCA4Rec provides a plug‑and‑play mechanism to inject structured CF supervision into LLM token‑level training, improving both semantic quality and collaborative relevance without modifying the underlying recommendation model. The approach demonstrates strong model‑independence and opens avenues for incorporating other non‑linguistic signals into generative systems.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Data Party THU
Official platform of Tsinghua Big Data Research Center, sharing the team's latest research, teaching updates, and big data news.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
