Can Hierarchical LLMs Transform Sequential Recommendation? A Deep Dive

This article provides a comprehensive analysis of the HLLM paper, detailing its hierarchical LLM architecture for item and user modeling, the training objectives, fusion strategies, extensive offline and online experiments, scaling behavior, ablation studies, and practical deployment insights in large‑scale recommendation systems.

Baobao Algorithm Notes
Baobao Algorithm Notes
Baobao Algorithm Notes
Can Hierarchical LLMs Transform Sequential Recommendation? A Deep Dive

Background

Traditional recommendation uses ID‑based embeddings for users and items, which suffer from cold‑start and limited capacity to model diverse interests. Recent LLM‑for‑recommendation (LLM4Rec) work explores using LLMs for auxiliary information, dialogue‑based recommendation, or feeding non‑textual features (e.g., IDs) directly to LLMs. Main challenges are longer input sequences for LLMs and modest performance gains.

Method

Hierarchical LLM (HLLM) architecture

HLLM consists of two independent LLMs:

Item LLM encodes item textual metadata (title, tags, description) together with a special [ITEM] token. The hidden state of [ITEM] is taken as the item embedding.

User LLM receives a sequence of item embeddings generated by the Item LLM and predicts the next item embedding.

Both LLMs are fine‑tuned from pretrained models such as Llama, Baichuan, or TinyLlama.

Training objectives

Recommendation is modeled as both a generative and a discriminative task.

Generative loss: InfoNCE contrastive loss where the positive sample is the true next‑item embedding and negatives are randomly sampled item embeddings.

Discriminative loss: binary cross‑entropy on whether a candidate item is relevant.

Auxiliary next‑item prediction loss is added. The total loss is L = L_infoNCE + L_BCE + λ·L_next where λ balances the auxiliary term.

Fusion strategies

Two ways to combine item embeddings with the User LLM:

Early Fusion : concatenate item embeddings to the token sequence of the User LLM, allowing early interaction but increasing computation.

Late Fusion : cache item embeddings and combine them with the User LLM output only at the final classification layer. The production system uses Late Fusion for efficiency.

Training procedure

Training proceeds in three stages:

Stage 1 : joint end‑to‑end training of Item LLM and User LLM on truncated user histories (≤150 tokens) to speed up.

Stage 2 : the trained Item LLM generates embeddings for the entire item catalog; these embeddings are frozen while the User LLM continues training on longer histories (up to 1000 tokens).

Stage 3 : both LLMs are frozen; the extracted embeddings are used as features for downstream online models.

Model variants

HLLM‑1B : built on TinyLlama‑1.1B.

HLLM‑7B : built on Baichuan2‑7B.

Experiments

Datasets and baselines

Public benchmarks: PixelRec and Amazon Books. Baselines: SASRec and the earlier HSTU model.

Results

Offline evaluation shows that both HLLM‑1B and HLLM‑7B achieve higher Recall and NDCG than baselines on both datasets. An online A/B test on the Douyin platform with HLLM‑1B, discriminative recommendation and Late Fusion yields a 0.705 % lift in key business metrics.

Scaling law

Performance improves monotonically with model size; the 7 B variant consistently outperforms the 1 B variant, confirming a scaling law for both Item and User LLMs.

Ablation studies

Item LLM : best results use title + tags + description, max length = 256 tokens, and the dedicated [ITEM] token (outperforms mean‑pooling).

User LLM : optimal input sequence length = 50 tokens; adding timestamp embeddings improves accuracy, while concatenating raw ID embeddings degrades performance.

Industrial scenario : caching item embeddings reduces inference latency with negligible accuracy loss; larger 7 B models and 1 k user sequence length further improve metrics.

Conclusion

HLLM demonstrates that a hierarchical combination of two LLMs can effectively extract rich item features and model user interests, integrating pretrained knowledge into recommendation pipelines. Fine‑tuning on recommendation objectives is essential, and the approach scales well with model size. Empirical results on public datasets and real‑world A/B tests confirm that HLLM surpasses traditional ID‑based methods while maintaining comparable serving efficiency.

Paper: https://arxiv.org/abs/2409.12740

Code: https://github.com/bytedance/HLLM

RecommendationLLMscaling lawSequential ModelingIndustrial Deployment
Baobao Algorithm Notes
Written by

Baobao Algorithm Notes

Author of the BaiMian large model, offering technology and industry insights.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.