How Multi-Attribution Learning Boosts Conversion Rate Prediction in Display Ads
This article introduces Multi-Attribution Learning (MAL), a novel paradigm that jointly models multiple attribution labels to overcome the single-attribution bottleneck in conversion rate (CVR) prediction, detailing its architecture, auxiliary tasks, extensive offline and online experiments, and significant business gains.
One‑Sentence Summary
To address the limitations of single‑attribution mechanisms, Alibaba’s advertising platform proposes the Multi‑Attribution Learning (MAL) paradigm, which jointly models First‑Click, Linear, MTA and other attribution labels, upgrading from single‑objective fitting to multi‑value learning.
Abstract
In advertising systems, conversion results are assigned to user touchpoints via attribution mechanisms to evaluate and optimize ad performance. Although many attribution methods (First‑Click, Last‑Click, Linear, MTA, etc.) exist, most production systems optimize for a single target attribution, usually Last‑Click.
The system generates conversion labels based on the target attribution and trains a CVR (conversion‑rate) prediction model, whose outputs are used for bidding and auction.
This single‑attribution training paradigm, called SAL (Single‑Attribution Learning) , ensures accuracy under the target attribution but fails to capture the evolution of user intent across the conversion journey, resulting in poorer AUC performance.
To overcome this bottleneck, Alibaba’s advertising team introduces MAL (Multi‑Attribution Learning) , which jointly learns labels from multiple attribution perspectives, shifting from fitting a single target to learning multi‑dimensional traffic value.
Multi‑Attribution Learning: Leverages rich labels from various attribution mechanisms as auxiliary targets to maximize prediction accuracy under the primary attribution.
The proposed architecture combines an Attribution Knowledge Aggregator (AKA) with a Primary Target Predictor (PTP) . AKA extracts representations from CVR labels of each attribution, while PTP uses these representations to predict the primary attribution label. An additional Cartesian‑Product Auxiliary Training (CAT) task models high‑order interactions among different attribution views.
These innovations yield a +0.5% offline GAUC and +2.6% online ROI improvement, especially in industries with long conversion paths such as large appliances and jewelry.
The MAL paradigm is applicable to all conversion‑prediction models, offering a new technical route for CVR estimation.
1. Background: Single‑Attribution Bottleneck in CVR Models
CVR models estimate the probability of conversion after a click, crucial for traffic allocation and user experience. A fundamental challenge is label ambiguity caused by multiple touchpoints, which must be allocated by an attribution mechanism.
Accurate attribution is the foundation for evaluating touchpoint contribution and optimizing ad spend. Common attribution types include:
Last‑Click Attribution: Credits the conversion to the last relevant click before conversion; all other clicks are treated as negative samples.
First‑Click Attribution: Credits the conversion to the first relevant click in the attribution window.
Linear Attribution: Distributes conversion credit evenly across all relevant clicks.
Data‑Driven Multi‑Touch Attribution (MTA): Uses causal inference models (e.g., Alibaba’s CausalMTA, LinkedIn’s LiDDA) to allocate credit.
Despite the variety of attribution mechanisms, most CVR models in practice train only on a single attribution label, leading to a “single‑attribution bottleneck” that limits the model’s holistic understanding of touchpoint value.
Examples of the bottleneck:
Training on Last‑Click over‑emphasizes “harvest” value and treats “seed” touchpoints as negative, biasing value measurement.
Training on MTA inherits MTA’s own prediction errors and cannot exploit the explicit positional information of First‑Click or Last‑Click.
To mitigate this, we propose the MAL paradigm, which fits multiple attribution labels simultaneously, reducing bias and improving prediction performance.
2. Solution: Multi‑Attribution Learning (MAL)
2.1 Optimization Objective and Overall Framework
Objective: Maximize prediction accuracy under the primary attribution mechanism.
Online advertising platforms provide reports for multiple attribution windows, but the CVR model used for ranking must output predictions for a specific “production‑window” target. We treat the target’s conversion label as the primary target and all other attribution labels as auxiliary targets . MAL’s goal is to exploit these auxiliary signals to improve the primary target’s accuracy, thereby enhancing traffic allocation efficiency, advertiser ROI, and user experience.
The overall framework consists of the Attribution Knowledge Aggregator (AKA) and the Primary Target Predictor (PTP) . Additionally, we introduce the Cartesian‑Product Auxiliary Training (CAT) task to provide a higher‑order auxiliary signal.
2.2 Model Structure Innovations
Although MAL can be viewed as a multi‑task learning problem, standard multi‑task architectures such as MMoE or PLE aim to optimize all tasks jointly, which conflicts with MAL’s objective of improving only the primary target. Therefore, we design a custom architecture:
Each auxiliary attribution has its own MLP head (the “auxiliary task predictor”) that shares the bottom embedding and feature‑extraction layers with the primary head. The penultimate layer of each auxiliary head produces a knowledge vector (e.g., for First‑Click, Last‑Click, Linear, MTA). These vectors are concatenated into a unified knowledge embedding, which is injected into the primary head via a “knowledge plug‑in” mechanism, enabling one‑way knowledge transfer from auxiliaries to the primary task.
2.3 Auxiliary Task Design
Observations from experiments show that adding any single auxiliary target improves the primary metric, and combining multiple auxiliaries yields larger gains. Inspired by this, we create the Cartesian‑Product Auxiliary Training (CAT) task, whose label is the Cartesian product of binary conversion labels from all attribution windows, forming a multi‑class classification problem when four attributions are present.
3. Offline and Online Experiment Results
3.1 Offline Gains and Ablation Studies
We evaluated MAL on Alibaba’s display‑ad data using First‑Click, Last‑Click, Linear, and MTA labels. When Last‑Click or MTA served as the primary target, MAL achieved a GAUC lift of +0.51% and +0.75% respectively, outperforming the best multi‑task baseline (Shared‑Bottom or PLE) by 0.20–0.28%.
Ablation studies confirm that:
CAT provides an additional GAUC lift of +0.13% without harming AUC.
The performance gain stems from the richer supervision of multi‑attribution labels rather than merely increased parameter count.
3.2 Online A/B Experiment
In a live budget A/B test on Taobao’s display‑ad system, the MAL model delivered +2.7% GMV, +2.6% ROI, and +1.2% order volume over the production baseline. Gains were especially pronounced in high‑ticket‑price categories (e.g., large appliances +11.6% orders, jewelry +9.7%).
4. Conclusion and Outlook
Existing CVR models suffer from the single‑attribution bottleneck, limiting their ability to capture the full user conversion journey. MAL introduces a multi‑attribution learning paradigm that shifts modeling from fitting a single target to learning multi‑dimensional traffic value, achieving substantial offline and online improvements.
Future directions include expanding the value system beyond purchases (e.g., detail‑page interactions, refunds), applying generative Transformer‑based modeling to serialized user‑touchpoint sequences, and integrating large language models for deeper semantic understanding of user intent.
References
Yao, D. et al. "CausalMTA: Eliminating the user confounding bias for causal multi‑touch attribution." KDD 2022.
Bencina, J. et al. "LiDDA: Data Driven Attribution at LinkedIn." arXiv preprint arXiv:2505.09861 (2025).
Ma, J. et al. "Modeling task relationships in multi‑task learning with multi‑gate mixture‑of‑experts." KDD 2018.
Tang, H. et al. "Progressive layered extraction (PLE): A novel multi‑task learning model for personalized recommendations." RecSys 2020.
Zhang, Y. et al. "KEEP: An industrial pre‑training framework for online recommendation via knowledge extraction and plugging." CIKM 2022.
Zhai, J. et al. "Actions Speak Louder than Words: Trillion‑Parameter Sequential Transducers for Generative Recommendations." ICML 2024.
Figures
Alimama Tech
Official Alimama tech channel, showcasing all of Alimama's technical innovations.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
