Can Hierarchical LLMs Transform Sequential Recommendation? A Deep Dive
This article provides a comprehensive analysis of the HLLM paper, detailing its hierarchical LLM architecture for item and user modeling, the training objectives, fusion strategies, extensive offline and online experiments, scaling behavior, ablation studies, and practical deployment insights in large‑scale recommendation systems.
Background
Traditional recommendation uses ID‑based embeddings for users and items, which suffer from cold‑start and limited capacity to model diverse interests. Recent LLM‑for‑recommendation (LLM4Rec) work explores using LLMs for auxiliary information, dialogue‑based recommendation, or feeding non‑textual features (e.g., IDs) directly to LLMs. Main challenges are longer input sequences for LLMs and modest performance gains.
Method
Hierarchical LLM (HLLM) architecture
HLLM consists of two independent LLMs:
Item LLM encodes item textual metadata (title, tags, description) together with a special [ITEM] token. The hidden state of [ITEM] is taken as the item embedding.
User LLM receives a sequence of item embeddings generated by the Item LLM and predicts the next item embedding.
Both LLMs are fine‑tuned from pretrained models such as Llama, Baichuan, or TinyLlama.
Training objectives
Recommendation is modeled as both a generative and a discriminative task.
Generative loss: InfoNCE contrastive loss where the positive sample is the true next‑item embedding and negatives are randomly sampled item embeddings.
Discriminative loss: binary cross‑entropy on whether a candidate item is relevant.
Auxiliary next‑item prediction loss is added. The total loss is L = L_infoNCE + L_BCE + λ·L_next where λ balances the auxiliary term.
Fusion strategies
Two ways to combine item embeddings with the User LLM:
Early Fusion : concatenate item embeddings to the token sequence of the User LLM, allowing early interaction but increasing computation.
Late Fusion : cache item embeddings and combine them with the User LLM output only at the final classification layer. The production system uses Late Fusion for efficiency.
Training procedure
Training proceeds in three stages:
Stage 1 : joint end‑to‑end training of Item LLM and User LLM on truncated user histories (≤150 tokens) to speed up.
Stage 2 : the trained Item LLM generates embeddings for the entire item catalog; these embeddings are frozen while the User LLM continues training on longer histories (up to 1000 tokens).
Stage 3 : both LLMs are frozen; the extracted embeddings are used as features for downstream online models.
Model variants
HLLM‑1B : built on TinyLlama‑1.1B.
HLLM‑7B : built on Baichuan2‑7B.
Experiments
Datasets and baselines
Public benchmarks: PixelRec and Amazon Books. Baselines: SASRec and the earlier HSTU model.
Results
Offline evaluation shows that both HLLM‑1B and HLLM‑7B achieve higher Recall and NDCG than baselines on both datasets. An online A/B test on the Douyin platform with HLLM‑1B, discriminative recommendation and Late Fusion yields a 0.705 % lift in key business metrics.
Scaling law
Performance improves monotonically with model size; the 7 B variant consistently outperforms the 1 B variant, confirming a scaling law for both Item and User LLMs.
Ablation studies
Item LLM : best results use title + tags + description, max length = 256 tokens, and the dedicated [ITEM] token (outperforms mean‑pooling).
User LLM : optimal input sequence length = 50 tokens; adding timestamp embeddings improves accuracy, while concatenating raw ID embeddings degrades performance.
Industrial scenario : caching item embeddings reduces inference latency with negligible accuracy loss; larger 7 B models and 1 k user sequence length further improve metrics.
Conclusion
HLLM demonstrates that a hierarchical combination of two LLMs can effectively extract rich item features and model user interests, integrating pretrained knowledge into recommendation pipelines. Fine‑tuning on recommendation objectives is essential, and the approach scales well with model size. Empirical results on public datasets and real‑world A/B tests confirm that HLLM surpasses traditional ID‑based methods while maintaining comparable serving efficiency.
Paper: https://arxiv.org/abs/2409.12740
Code: https://github.com/bytedance/HLLM
Baobao Algorithm Notes
Author of the BaiMian large model, offering technology and industry insights.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
