5 real ways to make money online in 2026

24 min read

How Multi-Attribution Learning Boosts Conversion Rate Prediction in Display Ads

This article introduces Multi-Attribution Learning (MAL), a novel paradigm that jointly models multiple attribution labels to overcome the single-attribution bottleneck in conversion rate (CVR) prediction, detailing its architecture, auxiliary tasks, extensive offline and online experiments, and significant business gains.

Alimama Tech

Aug 27, 2025

How Multi-Attribution Learning Boosts Conversion Rate Prediction in Display Ads

One‑Sentence Summary

To address the limitations of single‑attribution mechanisms, Alibaba’s advertising platform proposes the Multi‑Attribution Learning (MAL) paradigm, which jointly models First‑Click, Linear, MTA and other attribution labels, upgrading from single‑objective fitting to multi‑value learning.

Abstract

In advertising systems, conversion results are assigned to user touchpoints via attribution mechanisms to evaluate and optimize ad performance. Although many attribution methods (First‑Click, Last‑Click, Linear, MTA, etc.) exist, most production systems optimize for a single target attribution, usually Last‑Click.

The system generates conversion labels based on the target attribution and trains a CVR (conversion‑rate) prediction model, whose outputs are used for bidding and auction.

This single‑attribution training paradigm, called SAL (Single‑Attribution Learning) , ensures accuracy under the target attribution but fails to capture the evolution of user intent across the conversion journey, resulting in poorer AUC performance.

To overcome this bottleneck, Alibaba’s advertising team introduces MAL (Multi‑Attribution Learning) , which jointly learns labels from multiple attribution perspectives, shifting from fitting a single target to learning multi‑dimensional traffic value.

Multi‑Attribution Learning: Leverages rich labels from various attribution mechanisms as auxiliary targets to maximize prediction accuracy under the primary attribution.

The proposed architecture combines an Attribution Knowledge Aggregator (AKA) with a Primary Target Predictor (PTP) . AKA extracts representations from CVR labels of each attribution, while PTP uses these representations to predict the primary attribution label. An additional Cartesian‑Product Auxiliary Training (CAT) task models high‑order interactions among different attribution views.

These innovations yield a +0.5% offline GAUC and +2.6% online ROI improvement, especially in industries with long conversion paths such as large appliances and jewelry.

The MAL paradigm is applicable to all conversion‑prediction models, offering a new technical route for CVR estimation.

1. Background: Single‑Attribution Bottleneck in CVR Models

CVR models estimate the probability of conversion after a click, crucial for traffic allocation and user experience. A fundamental challenge is label ambiguity caused by multiple touchpoints, which must be allocated by an attribution mechanism.

Accurate attribution is the foundation for evaluating touchpoint contribution and optimizing ad spend. Common attribution types include:

Last‑Click Attribution: Credits the conversion to the last relevant click before conversion; all other clicks are treated as negative samples.

First‑Click Attribution: Credits the conversion to the first relevant click in the attribution window.

Linear Attribution: Distributes conversion credit evenly across all relevant clicks.

Data‑Driven Multi‑Touch Attribution (MTA): Uses causal inference models (e.g., Alibaba’s CausalMTA, LinkedIn’s LiDDA) to allocate credit.

Despite the variety of attribution mechanisms, most CVR models in practice train only on a single attribution label, leading to a “single‑attribution bottleneck” that limits the model’s holistic understanding of touchpoint value.

Examples of the bottleneck:

Training on Last‑Click over‑emphasizes “harvest” value and treats “seed” touchpoints as negative, biasing value measurement.

Training on MTA inherits MTA’s own prediction errors and cannot exploit the explicit positional information of First‑Click or Last‑Click.

To mitigate this, we propose the MAL paradigm, which fits multiple attribution labels simultaneously, reducing bias and improving prediction performance.

2. Solution: Multi‑Attribution Learning (MAL)

2.1 Optimization Objective and Overall Framework

Objective: Maximize prediction accuracy under the primary attribution mechanism.

Online advertising platforms provide reports for multiple attribution windows, but the CVR model used for ranking must output predictions for a specific “production‑window” target. We treat the target’s conversion label as the primary target and all other attribution labels as auxiliary targets . MAL’s goal is to exploit these auxiliary signals to improve the primary target’s accuracy, thereby enhancing traffic allocation efficiency, advertiser ROI, and user experience.

The overall framework consists of the Attribution Knowledge Aggregator (AKA) and the Primary Target Predictor (PTP) . Additionally, we introduce the Cartesian‑Product Auxiliary Training (CAT) task to provide a higher‑order auxiliary signal.

2.2 Model Structure Innovations

Although MAL can be viewed as a multi‑task learning problem, standard multi‑task architectures such as MMoE or PLE aim to optimize all tasks jointly, which conflicts with MAL’s objective of improving only the primary target. Therefore, we design a custom architecture:

Each auxiliary attribution has its own MLP head (the “auxiliary task predictor”) that shares the bottom embedding and feature‑extraction layers with the primary head. The penultimate layer of each auxiliary head produces a knowledge vector (e.g., for First‑Click, Last‑Click, Linear, MTA). These vectors are concatenated into a unified knowledge embedding, which is injected into the primary head via a “knowledge plug‑in” mechanism, enabling one‑way knowledge transfer from auxiliaries to the primary task.

2.3 Auxiliary Task Design

Observations from experiments show that adding any single auxiliary target improves the primary metric, and combining multiple auxiliaries yields larger gains. Inspired by this, we create the Cartesian‑Product Auxiliary Training (CAT) task, whose label is the Cartesian product of binary conversion labels from all attribution windows, forming a multi‑class classification problem when four attributions are present.

3. Offline and Online Experiment Results

3.1 Offline Gains and Ablation Studies

We evaluated MAL on Alibaba’s display‑ad data using First‑Click, Last‑Click, Linear, and MTA labels. When Last‑Click or MTA served as the primary target, MAL achieved a GAUC lift of +0.51% and +0.75% respectively, outperforming the best multi‑task baseline (Shared‑Bottom or PLE) by 0.20–0.28%.

Ablation studies confirm that:

CAT provides an additional GAUC lift of +0.13% without harming AUC.

The performance gain stems from the richer supervision of multi‑attribution labels rather than merely increased parameter count.

3.2 Online A/B Experiment

In a live budget A/B test on Taobao’s display‑ad system, the MAL model delivered +2.7% GMV, +2.6% ROI, and +1.2% order volume over the production baseline. Gains were especially pronounced in high‑ticket‑price categories (e.g., large appliances +11.6% orders, jewelry +9.7%).

4. Conclusion and Outlook

Existing CVR models suffer from the single‑attribution bottleneck, limiting their ability to capture the full user conversion journey. MAL introduces a multi‑attribution learning paradigm that shifts modeling from fitting a single target to learning multi‑dimensional traffic value, achieving substantial offline and online improvements.

Future directions include expanding the value system beyond purchases (e.g., detail‑page interactions, refunds), applying generative Transformer‑based modeling to serialized user‑touchpoint sequences, and integrating large language models for deeper semantic understanding of user intent.

References

Yao, D. et al. "CausalMTA: Eliminating the user confounding bias for causal multi‑touch attribution." KDD 2022.

Bencina, J. et al. "LiDDA: Data Driven Attribution at LinkedIn." arXiv preprint arXiv:2505.09861 (2025).

Ma, J. et al. "Modeling task relationships in multi‑task learning with multi‑gate mixture‑of‑experts." KDD 2018.

Tang, H. et al. "Progressive layered extraction (PLE): A novel multi‑task learning model for personalized recommendations." RecSys 2020.

Zhang, Y. et al. "KEEP: An industrial pre‑training framework for online recommendation via knowledge extraction and plugging." CIKM 2022.

Zhai, J. et al. "Actions Speak Louder than Words: Trillion‑Parameter Sequential Transducers for Generative Recommendations." ICML 2024.

Figures

Figure 3: Comparison of multi‑task vs. MAL architecture

Figure 6: Observation of auxiliary target scaling

Figure 8: Offline GAUC improvement by user group

conversion rate prediction online experiments advertising systems multi-attribution learning

Written by

Alimama Tech

Official Alimama tech channel, showcasing all of Alimama's technical innovations.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.