How Meta Scales User Modeling for Ads: Inside the SUM Framework
This article examines Meta's SUM (Scaling User Modeling) system, detailing its upstream‑downstream architecture, the SOAP online asynchronous serving platform, production optimizations, and extensive offline and online experiments that demonstrate significant gains in ad personalization performance.
Introduction
Personalized recommendation underpins modern online advertising, improving advertiser ROI while enhancing user experience. Traditional systems relied on handcrafted features and simple architectures, but deep‑learning‑based recommender systems now dominate, learning nuanced user representations at scale.
In practice, constraints such as training throughput, service latency, and host memory limit the full exploitation of massive user data, especially for large‑scale systems like Meta that handle billions of requests daily. The main challenges are sub‑optimal representations, feature redundancy, data scarcity for niche models, and non‑scalable custom architectures.
SUM Overview
Meta proposes Scaling User Modeling (SUM) , an online framework that reshapes user modeling for ad personalization. SUM combines advanced modeling techniques with real‑world constraints to enable effective, scalable representation sharing across hundreds of production ranking models.
Model Architecture
Preliminary
SUM adopts the DLRM architecture, consisting of a User Tower and a Mix Tower . The User Tower processes massive, heterogeneous user‑side features into compact embeddings, which the Mix Tower integrates with other features (e.g., ad attributes) for downstream prediction.
User Tower
The tower uses a hierarchical pyramid structure with residual connections to preserve information while compressing inputs. After initial embedding, the model produces many sparse embeddings and a smaller set of dense embeddings (dimension D), which are merged into a unified user embedding.
Attention‑Based Dimensionality Reduction
Point‑wise attention compression reduces the size of the dot‑product matrix, adding residual connections to learn expressive representations efficiently.
Deep Cross Network (DCN)
DCN layers capture explicit and implicit feature interactions through learnable weights and biases, enabling high‑order cross features.
MLP‑Mixer
The MLP‑Mixer, originally designed for vision, treats channel mixing as 1×1 convolutions and token mixing as depth‑wise convolutions, providing a fully‑MLP alternative for feature interaction.
Mix Tower
The Mix Tower mirrors the DHEN architecture but receives only the SUM user embeddings (no raw user features), encouraging the upstream model to learn high‑quality representations. Multi‑task cross‑entropy loss balances several downstream tasks using task‑specific weights.
Online Serving System: SOAP
Because user features change frequently, offline pipelines cause stale embeddings. SUM introduces the SUM Online Asynchronous Platform (SOAP) , which generates fresh embeddings on each request within a 30 ms latency budget. SOAP separates write (embedding generation) from read (embedding retrieval), allowing asynchronous updates without blocking downstream inference.
Productionization
Model Training
SUM models are trained offline in a cyclic fashion, preserving historical trends while adapting to dynamic user preferences. Snapshots are refreshed regularly and served online, ensuring up‑to‑date embeddings for downstream models.
Embedding Distribution Shift
Two strategies mitigate distribution shift: (1) align downstream models with the newest embeddings by adjusting training pipelines, and (2) apply average‑pooling of the two most recent embeddings to smooth changes, reducing performance degradation.
Feature Storage Optimization
Each user model produces K=2 embeddings of dimension D=96. Quantizing these from fp32 to fp16 halves storage without harming downstream performance.
Distributed Inference
Distributed inference spreads the online inference load across multiple nodes, improving memory usage and latency for the User Tower.
Experiments
Dataset
All experiments use internal, industrial‑scale datasets (public datasets are unsuitable due to domain mismatch).
Evaluation Metric
Standardized Entropy (NE) measures offline prediction accuracy by comparing log‑loss against a baseline CTR model; lower NE indicates better performance.
FB CTR SUM User Model
The Facebook CTR model processes ~60 billion daily events, with ~600 sparse and 1 000 dense user features. The User Tower occupies ~160 GB and requires ~390 M FLOPs per inference.
Downstream Offline Results
SUM embeddings improve NE across multiple downstream tasks (ad click‑through, conversion, app install, offline conversion). Gains are larger for Facebook than Instagram, highlighting domain differences. Embedding addition does not increase computational overhead for downstream models.
Online Performance
Deploying SUM across hundreds of Meta ad products yields a 2.67 % lift in overall ad metrics (statistically significant) with only a 15.3 % increase in service capacity.
Async Serving
Four serving strategies were compared: frozen model, offline batch, online real‑time, and online asynchronous (SOAP). Asynchronous serving reduced performance loss to ~10 % when switching from a compact 20 M FLOP model, outperforming the real‑time to batch transition.
Embedding Distribution Shift Study
Experiments measuring cosine similarity and L2 norm between consecutive daily embeddings show that average‑pooling across two snapshots markedly reduces shift and improves NE, confirming the effectiveness of the proposed mitigation technique.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
