How Meituan’s MTGR is Redefining Generative Recommendation at Scale
This article explains why Meituan introduced a generative recommendation model, describes the MTGR architecture, data organization, training and inference engines built on TorchRec and TensorRT, reports performance gains and cost reductions, and outlines future directions such as simplifying the recommendation funnel and cross‑business heterogeneous modeling.
Background: Why Generative Recommendation?
Traditional recommendation systems have reached a performance ceiling despite increasing model depth, MLP layers, and MoE experts. The scaling law of large language models (e.g., LLaMA, DeepSeek) shows that performance continues to improve with model size, data, and compute, inspiring a shift to generative modeling in recommendation.
MTGR – Meituan Generative Ranking
MTGR (Meituan Generative Ranking) integrates generative modeling ideas into the Meituan delivery ranking pipeline. It treats user behavior, clicks, exposures, and user profiles as a unified token sequence and employs a simplified Transformer architecture (based on the HSTU design) to process long sequences.
Key innovations include:
Data organization : Tokens are categorized as user_profile , lifelong_seq , rt_seq , and pv_items , each sharing a feature space.
Model structure : Multi‑Query Attention and large‑scale MoE (Scaling User Module) replace shallow Target‑Attention compression, preserving richer user behavior representations.
Cross‑feature handling : Instead of discarding cross features, MTGR uses scaling mechanisms to retain crucial signals such as merchant‑user distance.
Group LayerNorm and bidirectional attention for static features, dynamic encoding for real‑time features to prevent information leakage.
Challenges in Deploying Generative Recommendation
Existing infrastructure (TensorFlow 1.x) cannot efficiently support deep attention and large‑scale MoE. Training and inference costs rise sharply with model size, and cross‑feature removal leads to a hundred‑fold compute increase to recover performance.
MTGR‑Training Engine
Built on Meta’s open‑source TorchRec , the engine adds three layers:
Bottom layer : Customized TorchRec core with dynamic hash tables for frequently updated IDs.
Middle layer : Handles data loading, checkpointing, and consistency checks.
Top layer : Provides flexible model interfaces for research.
Performance optimizations include dynamic hash tables, gradient accumulation, ID deduplication (45% throughput gain), variable batch size balancing (30% gain), Cutlass‑based HSTU kernels (2‑3× faster attention), and offloading GAUC computation to data‑loading threads (10% gain).
MTGR‑Inference Engine
Uses TensorRT with Triton Inference Server for millisecond‑level latency. Optimizations cover H2D transfer reduction, hash‑table pruning, FP16 computation, operator fusion, and graph‑level optimizations.
Results
Scaling MTGR from small to large models consistently improves offline and online metrics. The large variant achieves a 65× increase in model complexity while delivering the best performance to date, and reduces inference cost by 44% compared with the previous DLRM‑based system. Retaining cross features yields far larger gains than pure model scaling.
Summary and Outlook
MTGR and its training/inference engines demonstrate that generative ranking can break the compute ceiling of traditional pipelines. Future work will explore simplifying the multi‑stage recommendation funnel and extending the token‑based design to heterogeneous, cross‑business scenarios.
DataFunSummit
Official account of the DataFun community, dedicated to sharing big data and AI industry summit news and speaker talks, with regular downloadable resource packs.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
