UQABench: A Personalized QA Benchmark for Evaluating User Embeddings in LLM‑Driven Recommendation Systems
UQABench introduces the first benchmark for assessing high‑density user embeddings that serve as soft prompts in LLM‑driven recommendation, featuring a three‑stage pre‑train‑align‑evaluate pipeline, seven personalized QA tasks, and findings that transformer encoders, side‑information, simple linear adapters, and larger models markedly improve accuracy while cutting input tokens to about five percent.
Large language models (LLMs) are reshaping recommendation systems and personalized question‑answering by leveraging their strong semantic understanding. However, directly feeding long user click sequences into LLMs faces two major issues: (1) efficiency bottlenecks because sequences can contain tens of thousands of tokens, exceeding LLM context windows; (2) noise from redundant or erroneous clicks that degrades personalization.
The proposed solution compresses user behavior into high‑density user embeddings (soft prompts) that guide LLM generation. To systematically assess the quality of these embeddings, the authors introduce UQABench, the first benchmark that evaluates user‑embedding effectiveness for personalized QA.
UQABench follows a three‑step pipeline: pre‑training of a user encoder on massive click data (e.g., SASRec, HSTU); alignment where a lightweight adapter (linear mapping or Q‑Former) bridges the encoder output to the LLM semantic space; and evaluation on a curated test set of 7,000 questions.
The benchmark defines three task dimensions covering seven sub‑tasks: sequence understanding (recovering explicit or cross‑feature information), action prediction (next‑item or attribute prediction), and interest perception (short‑term, long‑term, and interest‑drift inference). These tasks reflect both traditional recommendation metrics and the broader vision of LLM‑enhanced personalization.
Key experimental findings include:
Transformer‑based encoders (e.g., HSTU) outperform RNN‑based models (GRU4Rec, Mamba) in representing long‑term user interests.
Incorporating side information (category, brand, title) alongside IDs improves embedding quality.
Even a simple linear‑mapping + mean‑pooling adapter yields strong performance, while more complex adapters are sensitive to hyper‑parameters.
Embedding‑based prompts achieve comparable accuracy to pure‑text prompts while reducing LLM input tokens to ~5% of the original, dramatically lowering inference cost.
Scaling the encoder from 3 M to 1.2 B parameters shows a clear performance‑size law, suggesting that offline model enlargement can continuously boost online personalization without extra inference overhead.
The dataset comprises 180 k active Taobao users and 1 M items, with privacy‑preserving anonymization. Resources are publicly available: paper (https://arxiv.org/abs/2502.19178), code repository (https://github.com/OpenStellarTeam/UQABench), and dataset on Kaggle.
Alimama Tech
Official Alimama tech channel, showcasing all of Alimama's technical innovations.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.