Industry Insights 9 min read

How Meta’s Generative Recommendation (GR) Is Redefining Feature Engineering

Meta’s new Generative Recommendation (GR) paper replaces a decade‑old hierarchical feature paradigm with an ultra‑long sequence transformer that directly fuses user profiles, behaviors, and targets, offering stronger feature crossing, richer information utilization, and massive compute gains, while revealing scaling‑law effects in recommendation systems.

NewBeeNLP

Mar 15, 2024

How Meta’s Generative Recommendation (GR) Is Redefining Feature Engineering

Core Idea

The paper Unified Generative Recommendation (GR) proposes a transformer‑based encoder that treats user profile tokens, raw interaction sequences, and target information as a single ultra‑long sequence. Multiple transformer layers (up to 24 in the reported configurations) jointly model these signals, removing the need for handcrafted cross‑features.

Advantages over Traditional Hierarchical Feature Paradigm

Stronger feature‑crossing : User profile and target tokens are inserted directly into the raw behavior stream, avoiding information loss from compressed user representations.

More complete information utilization : Autoregressive next‑item prediction provides richer gradient signals than standard sampled‑item cross‑entropy, similar to auxiliary losses used in DIEN.

Richer behavior modeling : The architecture can ingest longer histories and additional signals such as exposures, not only clicks.

Cross‑Attention Sequence Construction

In a click‑through‑rate (CTR) scenario, exposure and click events are interleaved in a single sequence. Item‑level attributes (timestamp, category, etc.) are added as position‑like embeddings. An example token order is: item1, impression_no_click, item2, click, item3, click ... During training the model predicts the next click token while masking all other positions.

Training Objective and Sampling

The autoregressive loss predicts whether the next action is a click. Because the recommendation vocabulary exceeds a billion tokens, full softmax is infeasible; the authors employ sampling techniques and additional algorithmic tricks (details omitted in the paper) to approximate the loss.

Scaling‑Law Observations

The authors observe a scaling‑law effect analogous to large language models: model quality improves with both parameter count and dataset size. Reported models use an embedding dimension of 512 and up to 24 transformer layers, reaching compute scales comparable to GPT‑3 175B or LLaMA‑2 70B.

Inference Efficiency

Since the entire candidate set can be encoded as a single sequence, inference processes hundreds of candidates in one forward pass, reducing compute relative to multi‑stage pipelines. The reduction in feature engineering also lowers preprocessing overhead.

Performance Results

Training compute increased by roughly 1,000×, matching the scale of leading LLMs.

Benchmarks on MovieLens and Amazon Reviews show 20.3%–65.8% improvement in NDCG@10 over strong baselines (e.g., SASRec).

Online A/B tests on a ranking page reported a 12.4% lift in the primary engagement metric (E‑Task); combined recall‑plus‑ranking gains reach ~18.6%.

The new encoder (HSTU) combined with algorithmic sparsity runs about 15.2× faster than FlashAttention‑2 on training.

The inference algorithm (M‑FALCON) achieves up to 700× acceleration (285× for complex models) with a 2.48× increase in queries‑per‑second.

Implementation Details

Embedding dimension: 512.

Transformer depth: up to 24 layers (e.g., 3 layers for ranking, 6 for recall, rest for deeper modeling).

Sequence masking: mask m_i = 0 for action positions, enabling target‑aware cross‑attention in an autoregressive framework.

Sampling for large vocabularies is required during training; specific tricks are not disclosed in the paper.

References

Paper: https://arxiv.org/pdf/2402.17152.pdf

Discussion thread: https://www.zhihu.com/question/646766849/answer/3428951063

Recommendation Systems generative models scaling laws Meta

Written by

NewBeeNLP

Always insightful, always fun

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.