Artificial Intelligence 26 min read

How Meituan’s MTGR Framework Achieved 65× Faster Inference with Scaling Laws

Meituan’s recommendation team introduced the MTGR framework, aligning traditional DLRM features with a unified HSTU‑based Transformer to explore scaling laws, delivering a 65‑fold FLOPs boost, 12% lower inference cost, and record gains in online CTR and order volume across its food‑delivery platform.

Meituan Technology Team

May 15, 2025

How Meituan’s MTGR Framework Achieved 65× Faster Inference with Scaling Laws

Introduction

Scaling laws describe how model performance (loss or task metrics) varies with model size and training compute. While extensively studied for large language models, their impact on recommendation systems is still emerging. Meituan’s team introduced MTGR (Meituan Generative Recommendation) built on the HSTU architecture to investigate scaling effects in a real‑world food‑delivery recommendation scenario.

Industrial Generative Recommendation Approaches

Generative architecture : Meta’s GR and Kuaishou’s OneRec adopt LLM‑style Transformers with FlashAttention. They discard cross‑features, which degrades performance in low‑CTR, high‑repeat‑purchase domains such as food delivery.

Stacked architecture : Alibaba’s LUM and ByteDance’s HLLM augment traditional pipelines with generative stages but require multi‑stage serial optimization, increasing iteration cost.

Hybrid architecture : Retains cross‑features while leveraging Transformer power. MTGR follows this hybrid path.

Evolution of the DLRM Paradigm

Scaling the Cross Module (2018‑2022) : Feature concatenation and complex non‑linear mappings (PLE, MoE, DCN, PEPNet) were applied, but performance gains plateaued as model size grew.

Scaling the User Module (2023) : User profile and behavior sequences are encoded with multi‑query attention and a 16‑expert MoE, increasing FLOPs by 182% and delivering a 0.60% lift in order volume while keeping inference cost modest.

MTGR Framework and Practice

Model Architecture

MTGR preserves all DLRM features, compresses training data per user, and eliminates padding through sparse storage and JaggedTensor. Three model sizes (small, middle, large) were built to verify scaling‑law effects.

Data & Features : User Profile, Context, User Behaviour Sequence, and Target Item are retained, preserving cross‑features to avoid information loss.

Tokenization : Each feature becomes a token; behaviour tokens concatenate item ID, side‑info embeddings, and context before a non‑linear projection.

Group LayerNorm : Separate LayerNorm parameters per token type improve training stability.

Dynamic Mixed Mask : Tokens are divided into static (User Profile & Sequence), real‑time (Real‑Time Sequence), and target (Targets). Static tokens receive no mask, real‑time tokens use a causal mask with timestamp filtering, and target tokens use a time‑aware mask to respect causality while maximizing context.

Training Engine

MTGR‑Training is built on Meta’s open‑source TorchRec and adds dynamic hash tables, gradient accumulation, and model‑parallelism support (FSDP/Megatron). Key optimizations:

Kernel optimization : A fused Cutlass‑based HSTU kernel reduces memory I/O and supports variable‑length sequences, achieving 2‑3× speedup over the Triton implementation.

Variable‑length batch balancing : Dynamic batch sizes per GPU equalize total token counts, preventing stragglers and improving throughput.

Inference Engine

Inference uses Nvidia TensorRT for kernel fusion and Triton Inference Server for deployment. Optimizations include:

Feature H2D optimization : A merge‑then‑split strategy reduces host‑to‑GPU transfer from 7.5 ms to 12 µs, cutting latency by 37% and raising throughput by 38%.

CUDA Graph : Improves throughput by 13% and reduces tail latency up to 57%.

FP16 precision : Boosts throughput 50% with negligible accuracy loss (difference < 0.006).

Scaling Effects

Three MTGR sizes (small, middle, large) were compared against the best online DLRM baseline (Scaling User Module). Despite training on only six months of data, MTGR achieved:

Offline CTCVR GAUC +2.88 pp

Homepage order volume +1.22 %

PV‑CTR +1.31 %

Single‑sample forward FLOPs 65× higher (55.76 GFLOPs) with 12% lower inference resource usage.

Scaling law results for different MTGR sizes

Conclusion and Outlook

Scaling laws are now a cornerstone of deep learning, yet their application to recommendation systems is still nascent. MTGR demonstrates that a hybrid architecture preserving DLRM cross‑features, combined with Group LayerNorm and dynamic mixed masking, can unlock substantial performance gains while keeping training cost comparable to the baseline and reducing inference resources by 12%.

Future work includes:

Enhancing HSTU to better capture spatio‑temporal signals for location‑based services.

Extending MTGR to multi‑scenario, user‑centered generative recommendation with KV‑Cache for faster inference.

Integrating coarse‑ranking and fine‑ranking stages into a single MTGR pass for scenarios with limited supply.

Transformer inference optimization Recommendation Systems Scaling Law Large-Scale Training MTGR

Written by

Meituan Technology Team

Over 10,000 engineers powering China’s leading lifestyle services e‑commerce platform. Supporting hundreds of millions of consumers, millions of merchants across 2,000+ industries. This is the public channel for the tech teams behind Meituan, Dianping, Meituan Waimai, Meituan Select, and related services.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.