Industry Insights 18 min read

How Alibaba’s Large User Model (LUM) Boosted CTR by 4.5% and Scaled to Billions of Parameters

The article analyzes the evolution from traditional modular recommendation models to a generative Large User Model (LUM), detailing its three‑stage paradigm, tokenization, training objectives, scaling‑law findings, offline and online experiments, and the AI‑infra innovations that enabled a 4.5% CTR lift in production.

Alimama Tech

Mar 26, 2026

How Alibaba’s Large User Model (LUM) Boosted CTR by 4.5% and Scaled to Billions of Parameters

Background and Paradigm Shift

The rapid progress of large language models (LLMs) is driven by three aligned advances: (1) the Transformer self‑attention architecture (2017) that enables high parallelism, (2) scaling laws (2020) showing power‑law growth of performance with model size, data, and compute, and (3) engineering optimizations such as FlashAttention and DeepSpeed that raise MFU above 30% on modern GPUs. These factors produce three core LLM traits—unified token representation, generative autoregressive objectives, and GEMM‑centric compute—that scale predictably with hardware.

Structural Dilemma of Traditional Recommendation Models

Conventional search‑advertising estimators evolve from FM/FFM feature crosses to DIN/DIEN attention and DeepFM/DCN high‑order interactions. They rely on manually engineered modular stacks, resulting in:

Very low arithmetic intensity: most cycles are spent on sparse embedding lookups, while MLP and feature‑cross computations are negligible, leading to MFU in the single‑digit range.

Highly fragmented kernels: thousands of tiny CUDA kernels cause frequent launches and non‑contiguous memory accesses.

Lack of predictable scaling: increasing model capacity does not yield stable gains, breaking the compute‑to‑performance relationship.

Consequently, modern Tensor‑Core‑centric GPUs are under‑utilized.

From Discriminative to Generative Modeling

To overcome these limits, researchers reformulate the estimation task as a generative next‑item prediction using a Transformer (E2E‑GR). Generative models can capture joint data distributions and benefit from parameter scaling, but they face:

Training‑inference inconsistency.

Latency bottlenecks that hinder real‑time serving.

Limited flexibility for new behavior types.

Incompatibility with existing feature pipelines.

Large User Model (LUM) – A Unified Three‑Stage Framework

LUM is a universal user‑base model for industrial recommendation. Its pipeline consists of PreTrain → PostTrain → Application :

Stage 1 – Pre‑training : User histories are tokenized into (condition token, item token) pairs. Condition tokens encode context (scene, query) while item tokens encode product attributes. The architecture contains a Token Encoder that densely embeds IDs, statistics, and content, and a User Encoder – an autoregressive Transformer over the token sequence. Training uses an in‑batch InfoNCE loss with cosine similarity, approximating a 22 k negative set.

Stage 2 – Triggering : Analogous to prompt engineering, different condition tokens are set to elicit task‑specific knowledge from LUM, enabling flexible preference extraction.

Stage 3 – Application : Generated token representations are either fed directly as features to downstream CTR models or used for similarity matching between target items and generated sequences, allowing seamless integration with existing pipelines.

Scaling‑law analysis shows power‑law relationships between model size (19 M → 7 B) or sequence length (256 → 8192) and performance.

Pre‑training Technical Details

Tokenization : Each user action is represented as a pair ⟨condition token, item token⟩. Condition tokens carry contextual signals such as query or scene; item tokens contain product identifiers (category ID, brand ID, etc.).

Architecture :

Token Encoder : Concatenates raw IDs, statistical features, and content embeddings, then applies a linear projection to obtain a dense token vector.

User Encoder : A standard autoregressive Transformer processes the interleaved sequence c₁,i₁,c₂,i₂,… to model user preference and collaborative signals.

Training Objective : Because the item catalog contains billions of items, LUM adopts an in‑batch contrastive loss (InfoNCE). For a batch of size B, each positive pair (c,i) is contrasted against all other items in the batch (≈22 k negatives). The loss is:

loss = -log( exp(sim(z_c, z_i) / τ) / Σ_{j∈batch} exp(sim(z_c, z_j) / τ) )

where sim is cosine similarity and τ is a temperature hyper‑parameter.

Triggering (Prompt Engineering)

By setting different condition tokens (e.g., the current query in a search scenario), LUM can generate user preferences conditioned on that context. This mechanism enables a single pre‑trained model to serve multiple downstream tasks without retraining.

CTR Application

LUM outputs are incorporated into CTR models in two ways:

Direct Feature Incorporation : The generated token sequence is appended as additional dense features.

Interest Matching : Similarity between a target item embedding and the generated item embeddings is computed to produce a preference score.

Experimental Results

Offline benchmarks on public and internal datasets show that LUM consistently outperforms traditional DLRM and E2E‑GR baselines on recall and AUC for both recall and estimation tasks. In a full‑scale online deployment within Alibaba Mama’s core search‑advertising pipeline, LUM yields a CTR increase of 4.5% and a consumption reduction of 2% .

AI Infrastructure for Production

Two systems enable LUM’s industrial rollout:

Blaze‑O1 Inference Engine : Heterogeneous operator orchestration (Torch sub‑graphs, custom CUDA kernels, TensorRT‑LLM), multi‑stream DAG scheduling, GPU‑accelerated HNSW vector retrieval, and a large KV cache with user‑level consistent hashing to meet low‑latency, high‑throughput requirements.

RecIS Training Framework : A unified sparse‑dense training stack built on PyTorch that combines optimized embedding memory access with dense Transformer optimizations (FlashAttention, FSDP), delivering superior training performance over legacy TensorFlow pipelines.

References

[1] Unlocking Scaling Law in Industrial Recommendation Systems with a Three‑step Paradigm based Large User Model, WSDM 26. URL: https://arxiv.org/abs/2502.08309

[2] UQABench: Evaluating User Embedding for Prompting LLMs in Personalized Question Answering, KDD 25.

[3] RecIS: Sparse to Dense, A Unified Training Framework for Recommendation Models.

[4] From Scaling to Structured Expressivity: Rethinking Transformers for CTR Prediction.

CTR prediction large language models Recommendation Systems generative modeling scaling laws

Written by

Alimama Tech

Official Alimama tech channel, showcasing all of Alimama's technical innovations.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.