5 real ways to make money online in 2026

17 min read

Universal Recommendation Model (URM): A General Large‑Model Recall System for Advertising

The article presents the Universal Recommendation Model (URM), a large‑language‑model‑based recall framework that integrates world knowledge and e‑commerce expertise through knowledge injection and prompt‑driven alignment, achieving significant offline recall gains and a 3.1% increase in ad consumption while meeting high‑QPS, low‑latency production constraints.

Alimama Tech

May 12, 2025

Universal Recommendation Model (URM): A General Large‑Model Recall System for Advertising

Recently, Alibaba Mama co‑hosted a tutorial on computational advertising algorithms at the International World Wide Web Conference (WWW), introducing the technical evolution of the field and unveiling the Universal Recommendation Model (URM), a general large‑model recall system in the LMA2 advertising model series.

1. Overview

With the rapid development of large language models (LLMs), capabilities such as world knowledge and logical reasoning open new possibilities for recommendation and advertising systems. Existing approaches often use LLMs merely as feature extractors or relevance classifiers, limiting their impact. URM directly employs LLMs in an end‑to‑end fashion, addressing two core challenges: (1) LLMs lack e‑commerce‑specific knowledge, and (2) representing user behavior as text leads to long, low‑density inputs.

URM solves these issues by injecting domain knowledge and aligning information, allowing the LLM to act as an expert that combines world and e‑commerce knowledge. Leveraging the LLM base directly gives URM natural advantages in multitask, multimodal, and long‑sequence understanding. Offline recall metrics show URM surpasses traditional models, and it can generate recall results guided by prompts. An asynchronous inference pipeline was designed to meet low‑latency, high‑QPS requirements.

Deployed in Alibaba Mama display advertising, URM has delivered a 3.1% increase in overall consumption and boosted exposure and conversion efficiency for long‑tail ads.

2. Universal Recommendation Model (URM)

The recall goal is to select the highest‑value subset from a candidate pool for each request.

In URM, user and candidate features are encoded as a mixture of text tokens and ID tokens. Text tokens are embedded by the LLM’s own embedding table, while ID tokens are mapped via an additional hash‑table + MLP projection to the same dimension. The LLM’s Transformer layers generate a user embedding token ([UM]) and a text token ([LM]), decoding product IDs and predicted categories. By changing the prompt description, the same model can produce different recall sets.

2.1 Dataset Construction & Training: General Multitask Modeling

Inspired by GPT training, tasks are defined in natural language and represented as sequences. Various text templates are designed, and product IDs are treated as special tokens inserted into the sequence. A typical prompt combines ordinary text tokens with product ID tokens such as [7502].

Training objectives include direct generation of product IDs for efficiency, as well as a text‑generation loss to align semantic spaces and incorporate external textual knowledge. Different recommendation tasks are expressed with distinct prompts, e.g.:

Search is treated as a constrained recommendation task; positive samples are adjusted according to prompt constraints.

The formal training loss combines a Noise‑Contrastive Estimation (NCE) loss for item recommendation and a negative log‑likelihood loss for text generation:

2.2 Item Representation: Multimodal Fusion

To exploit both ID‑based product information and textual world knowledge, a simple yet efficient multimodal fusion module is designed. LLM‑derived item embeddings and ID embeddings are first projected to a shared space via MLPs, summed, normalized with RMSNorm, and then passed through another MLP to the user‑embedding space.

2.3 Efficient Item Generation: Sequence‑In‑Set‑Out

To retain rich user‑item modeling while reducing inference cost, URM generates multiple user representation tokens ([UM] tokens) in a single forward pass. Each token captures a different aspect of user interest, and their dot‑products with item embeddings are linearly combined (max pooling works best) to produce final scores.

Recall is framed as selecting the highest‑value items from a millions‑scale candidate pool, analogous to next‑token generation in language models. To make this tractable, URM combines LLM inference with hierarchical HNSW indexing, computing probabilities over a sub‑matrix of item embeddings and iteratively refining the candidate set.

2.4 Offline Experiments

URM, trained with multitask learning, achieves an average 11.0% Recall improvement on production data and outperforms traditional Target‑Attention models across six sub‑tasks (nine tasks total).

Ablation studies confirm the effectiveness of the fusion module and the positive impact of increasing the number of UM tokens on Recall.

URM also demonstrates strong generalization to unseen prompts and tasks, maintaining reasonable performance when transferred from recommendation to search scenarios.

3. Deployment Under High QPS & Low Latency Constraints

Because LLM inference latency is high, an asynchronous inference pipeline was built: user behavior triggers URM inference, results are persisted, and online recall reads the stored outputs.

The service stack includes feature processing, URM request handling, and result persistence. Prompt templates are treated as complex feature operators, enabling seamless migration to online inference.

Inference runs on Alibaba Mama’s HighService framework and vLLM, using FlashAttention and multi‑instance deployment to achieve up to 200% QPS improvement, with overall latency comparable to generating a single token.

4. Conclusion

The URM model integrates large‑model world knowledge with e‑commerce domain expertise, delivering more accurate user interest predictions and higher ad efficiency, thereby benefiting consumers, merchants, and the platform alike. Further details are available in the accompanying arXiv paper: https://arxiv.org/pdf/2502.03041.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Advertising recommendation Prompt engineering recall Large Language Model Multimodal high QPS

Written by

Alimama Tech

Official Alimama tech channel, showcasing all of Alimama's technical innovations.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.