Artificial Intelligence 20 min read

How AdaSparse Boosts Multi‑Scenario CTR Prediction with Adaptive Sparse Networks

AdaSparse introduces an adaptive sparse network that learns a dedicated sub‑network for each advertising scenario, balancing shared and specific knowledge while keeping computational cost low, and achieves +4.63% CTR and -3.82% CPC improvements in Alibaba’s external ad system, as validated on both public and massive production datasets.

Alimama Tech

May 10, 2023

How AdaSparse Boosts Multi‑Scenario CTR Prediction with Adaptive Sparse Networks

Abstract

CTR (click‑through‑rate) prediction is a fundamental task in recommendation and advertising. AdaSparse addresses two industrial challenges—cross‑scenario generalization, especially for sparse scenarios, and bounded computational cost—by learning an adaptive sparse sub‑network for each scenario. Deployed in a large‑scale ad system, the method yields +4.63% CTR and -3.82% CPC.

Background

In practice, CTR data come from many business units (e.g., Tmall, Ele.me, Ant) and ad placements, each exhibiting distinct user behavior and label distributions. Treating each distinct feature that causes large behavior differences as a “scenario” leads to a multi‑scenario setting where a single model must capture both shared knowledge and scenario‑specific patterns.

Related Work

Early multi‑scenario approaches (e.g., SAML, SAR‑NET) manipulate low‑level embeddings or intermediate representations. Later methods such as BiasNet, PPNET, and STAR push scenario information to higher layers or core parameters, but they often cause parameter explosion and struggle with extremely sparse scenarios. Recent works (M2M, APG) generate scenario‑aware parameters for the main network, yet still suffer from high memory and compute overhead.

AdaSparse Method

Model Overview

The backbone is a fully‑connected DNN (or DCNv2) that receives concatenated scenario‑related and scenario‑agnostic embeddings. For each hidden layer, a lightweight auxiliary network—called the domain‑aware pruner—takes the layer’s hidden representation together with scenario embeddings and outputs a weight‑factor vector that scales (or masks) the neurons. Three generation strategies are explored:

Binarization : hard 0‑1 masking.

Scaling : soft weighting without hard pruning.

Fusion : combination of binarization and scaling.

The overall workflow is illustrated in Figure 3.

Scene‑Adaptive Pruner

The pruner is a shallow MLP ending with a sigmoid activation followed by a soft‑threshold function. The sigmoid produces values in (0,1); a learnable scaling parameter makes the curve steeper during training, encouraging near‑binary outputs. The soft‑threshold forces values below a learned threshold to zero, effectively pruning unimportant neurons. The three weight‑factor generation methods are visualized in Figure 4.

Sparsity Regularization

To let each scenario automatically settle on an appropriate sparsity level, a regularizer is added to the loss. Let s denote the sparsity of a layer, L_1 the L1 norm of the hidden vector, and d the hidden dimension. The regularizer activates only when s falls outside a preset interval [s_{min}, s_{max}] and its strength grows over training epochs, allowing the model to first fit labels and later enforce sparsity. This balances accuracy, parameter reduction, and training stability.

Experiments

Offline Evaluation

Two datasets were used:

Public IAAC dataset: 10 M samples, 300 scenarios.

Proprietary Production dataset: 2.2 B samples, >5 000 scenarios.

Baselines include MAML, STAR, and APG. AdaSparse consistently outperforms baselines, with larger gains as the number of scenarios increases. Ablation studies show that neuron‑level pruning and scaling each contribute positively, and their fusion yields the best performance.

Further Analysis

Cross‑scenario generalization : Similar domains share more neurons, confirming that AdaSparse learns common structures while preserving scenario‑specific ones.

Sparse‑scenario performance : Binning scenarios by exposure shows AdaSparse retains advantage on long‑tail (sparse) domains.

Number of scenarios vs. performance : Accuracy improves with scenario granularity up to a point, after which it degrades, indicating the need for careful scenario design.

Sparsity‑effect trade‑off : Varying the sparsity control range reveals an optimal region (≈0.35–0.40) where AdaSparse matches a baseline DNN; excessive pruning harms accuracy.

Detailed analysis of generalization, sparsity, and scenario count

Online A/B Test

The Fusion version of AdaSparse was deployed in production. Evaluation metrics—CTR, CPC (cost per click), and UPR (percentage of scenarios with positive uplift)—showed +4.63% CTR, -3.82% CPC, and >90% UPR, confirming real‑world effectiveness.

Conclusion and Future Work

AdaSparse demonstrates that combining network pruning with scenario‑aware weight generation yields a lightweight, high‑performing multi‑scenario CTR model. Future directions include mitigating the dominance of large scenarios in the auxiliary network and exploring how small scenarios can assist larger ones.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Advertising deep learning CTR Prediction multi‑scenario modeling network pruning adaptive sparsity

Written by

Alimama Tech

Official Alimama tech channel, showcasing all of Alimama's technical innovations.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.