Artificial Intelligence 12 min read

How Hierarchical LLMs Are Transforming Recommendation Systems – A Deep Dive into HLLM

This article provides a comprehensive analysis of the HLLM paper, detailing the motivation behind using large language models for recommendation, the hierarchical architecture of Item and User LLMs, the training objectives, extensive offline and online experiments, scaling behavior, and practical deployment insights.

NewBeeNLP

Oct 29, 2024

How Hierarchical LLMs Are Transforming Recommendation Systems – A Deep Dive into HLLM

Background

Traditional recommendation relies on ID‑based embeddings for users and items, which suffer from cold‑start issues and limited capacity to model diverse user interests. Recent work explores three directions for LLM‑based recommendation: providing auxiliary information to recommenders, turning recommenders into dialogue systems, and feeding non‑text features (e.g., IDs) directly to LLMs.

LLM4Rec faces two main challenges: longer and more complex inputs compared to ID‑based methods, and modest performance gains over strong baselines.

Key Research Questions

Can the knowledge stored in pretrained LLM weights be activated without restricting inputs to plain text?

Is fine‑tuning on recommendation tasks necessary, or can the pretrained model be used directly?

Do LLM‑based recommenders exhibit scaling laws similar to other LLM applications?

Proposed Architecture: Hierarchical Large Language Model (HLLM)

HLLM consists of two separate LLMs:

Item LLM : Takes item description (title, tags, full description) plus a special [ITEM] token and outputs an item embedding.

User LLM : Consumes a sequence of item embeddings generated by the Item LLM (no word embeddings) and predicts the next item embedding.

Both modules can be initialized from existing pretrained models such as TinyLlama‑1.1B or Baichuan2‑7B.

Optimization Objectives

Recommendation can be framed as either generative or discriminative. HLLM combines both:

Generative loss : Uses InfoNCE contrastive loss where the positive sample is the model’s predicted embedding for a target item and negatives are randomly sampled item embeddings.

Discriminative loss : Binary classification of whether a user would interact with a candidate item (click, like, purchase).

The final loss is a weighted sum of the generative loss, discriminative loss, and an auxiliary next‑item prediction loss.

Experiments

Datasets and Baselines

Public datasets: PixelRec and Amazon Books. Baselines: SASRec and HSTU.

Model Variants

Two sizes were evaluated:

HLLM‑1B built on TinyLlama‑1.1B.

HLLM‑7B built on Baichuan2‑7B.

Training used three stages: (1) end‑to‑end training of both Item and User LLMs with truncated user histories (150 steps); (2) freeze Item LLM, cache all item embeddings, and train User LLM with longer histories (1000 steps); (3) fix both LLMs and train a downstream model using the extracted user and item features.

Offline experiments employed generative recommendation for fair comparison, while online A/B tests used discriminative recommendation with Late Fusion for production compatibility.

Results show that HLLM consistently outperforms SASRec and HSTU on R@200, with a 0.705 % lift in key online metrics.

Scaling Experiments

Both Item and User LLMs exhibit scaling behavior: larger model sizes yield higher performance, confirming the presence of a scaling law.

Ablation Studies

Item LLM : Best performance with input format Tag + Title + Description and sequence length 256; using the special [ITEM] token outperforms mean‑pooling.

User LLM : Optimal user sequence length is 50 tokens; LLM‑derived embeddings surpass traditional ID embeddings, and adding timestamp embeddings further improves results.

Industrial Setting : In large‑scale production (e.g., TikTok), the 7B variant with 1k user sequence length delivers the best trade‑off between accuracy and latency. Caching item embeddings reduces inference cost without sacrificing performance, and the gap between caching and full fine‑tuning narrows as pretraining data grows.

Conclusions

HLLM demonstrates that hierarchical LLMs can effectively extract item features and model user interests, integrating pretrained knowledge into recommendation pipelines. Fine‑tuning on the recommendation objective is crucial, and larger models show clear scaling benefits. Offline benchmarks and online A/B tests confirm that HLLM surpasses ID‑based methods while maintaining comparable serving efficiency, marking a significant step forward for LLM‑driven recommendation.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

A/B testing Hierarchical LLM LLM for recommendation

Written by

NewBeeNLP

Always insightful, always fun

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.