Unveiling Meta’s Wukong: How Scaling Laws Boost Large‑Scale Recommendation Performance

Meta’s new paper introduces the Wukong model, demonstrating that expanding dense‑layer parameters and computational FLOPs in large‑scale recommendation systems follows a clear scaling law, yielding consistent performance gains across massive internal datasets, with detailed analysis of feature modules, parameter impacts, and experimental results.

NewBeeNLP
NewBeeNLP
NewBeeNLP
Unveiling Meta’s Wukong: How Scaling Laws Boost Large‑Scale Recommendation Performance

Paper Information

Title: Wukong: Towards a Scaling Law for Large‑Scale Recommendation

Link: https://arxiv.org/abs/2403.02545

Authors: All authors are from Meta

Overall Summary

The authors investigate whether increasing the parameter count of the dense (non‑embedding) layers in a recommendation system yields continuous improvements in key metrics. Using an internal dataset with 146 billion items and 720 features, they scale the dense‑layer FLOPs from 1 GFLOP/example to 100 GFLOP/example (comparable to GPT‑3) and increase dense‑layer parameters from 0.74 B to 17 B. Performance improves steadily, leading to a reported scaling law: a roughly two‑order‑of‑magnitude increase in training compute yields about a 0.1 % relative LogLoss gain per four‑fold increase in compute.

Introduces a new feature‑cross architecture called Wukong that achieves state‑of‑the‑art results on several offline benchmarks.

Demonstrates a scaling law for recommendation systems: performance improves ~0.1 % for each quadrupling of compute.

Wukong Architecture and Scaling Strategy

2.1 Feature Module

Each raw feature is split into smaller sub‑dimensions (e.g., a 32‑dimensional feature becomes four 8‑dimensional units). Important features receive more units, effectively increasing their dimensionality. Variable‑length features can be trained at full length and later pruned via feature dropout. During feature crossing, each unit is treated as an independent feature, simplifying high‑order interactions.

2.2 Wukong Layer

A Wukong model follows a typical CTR pipeline: a Wukong Layer performs feature crossing, then an MLP produces the final prediction. Each Wukong Layer consists of:

Factorization Machine Block (FMB) : a feature‑cross module similar to DCNv2 that first compresses the number of feature embeddings and then computes high‑order interactions.

Linear Compress Block (LCB) : a linear projection using a weight matrix; a hyper‑parameter determines the output embedding size for the layer.

The outputs of FMB and LCB are concatenated, added to a residual connection, and passed through LayerNorm.

Wukong Layer diagram
Wukong Layer diagram

2.3 Scaling Parameters

Performance is scaled by adjusting the following hyper‑parameters: l: number of interaction‑stack layers. n_F: number of embeddings generated by the FMB. n_L: number of embeddings generated by the LCB. k: compression factor inside the FM (controls how many embeddings are kept after compression). MLP: depth and hidden size of the MLP inside the FMB.

The authors first increase l (layer depth) and then tune the remaining parameters.

Experimental Results

3.1 Offline Comparison on Public Benchmarks

Wukong achieves competitive or superior results on several public recommendation datasets (details in the paper).

Offline benchmark results
Offline benchmark results

3.2 Experiments on Meta’s Internal Dataset

The internal dataset contains 146 billion examples and 720 distinct features. Two prediction tasks are evaluated: click‑through rate (CTR) and conversion prediction.

Dataset and Training Settings

All embedding vectors have a fixed length of 160; embedding size does not grow with dense‑layer scaling.

Dense layers are optimized with Adam; embedding tables use Row‑wise Adagrad.

Batch size: 262,144 examples per step, executed on 128–256 NVIDIA H100 GPUs.

Metrics:

GFLOP/example : gigaflops required per training example.

PF‑days : total training compute equivalent to running a 1 PFLOP/s machine for one day.

#Params : total model parameters (embedding table size fixed at 627 B parameters).

Relative LogLoss : improvement over a fixed baseline; a 0.02 % relative LogLoss gain is considered significant on this dataset.

Scaling Results

Training compute is increased from 1 GFLOP/example up to 100 GFLOP/example, corresponding to model parameter counts from 0.74 B to 17 B in the dense layers. LogLoss decreases monotonically as compute grows, confirming the scaling law:

Every four‑fold increase in model compute yields roughly a 0.1 % relative LogLoss improvement.

The relationship holds both when scaling by FLOPs and when scaling by total parameter count.

LogLoss vs. compute
LogLoss vs. compute

Parameter ablation studies show that increasing interaction‑related parameters ( l and n_F) yields the largest gains. Combinations of k, n_F, and n_L also improve performance, while increasing n_L alone has minimal effect.

Parameter ablation results
Parameter ablation results

Appendix: Observations

The study demonstrates a clear scaling law for recommendation systems: allocating more compute to feature interaction leads to consistent performance improvements. The authors note that latency constraints for recommendation are stricter than for large language models, making efficient packing of compute a critical engineering challenge. Sequence‑modeling components were not explored in this work.

deep learningRecommendation Systemsscaling lawLarge-Scale AIMetaCTR modelsWuKong
NewBeeNLP
Written by

NewBeeNLP

Always insightful, always fun

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.