Artificial Intelligence 22 min read

Model-Based Reinforcement Learning Auto‑Bidding Algorithms for Online Advertising

The paper introduces a model‑based reinforcement‑learning auto‑bidding framework that learns a neural‑network environment model from real logs, generates confidence‑aware virtual data fused with real data, and employs the COMBO+MICRO stabilizer and a Lagrange‑dual method for ROI‑constrained bidding, delivering up to 6.8 % higher consumption, 5 % GMV growth and 3.7 % ROI improvement on Alibaba’s platform.

Alimama Tech
Alimama Tech
Alimama Tech
Model-Based Reinforcement Learning Auto‑Bidding Algorithms for Online Advertising

1. Overview

Reinforcement‑learning (RL) auto‑bidding has become a flagship technology in intelligent ad delivery, yet suffers from offline‑inconsistency and limited coverage of online data. This work proposes a Model‑Based RL (MBRL) training paradigm that mitigates these issues by learning a neural‑network environment model from real‑world data and generating high‑confidence virtual data for policy training.

2. Paradigm Shift of RL Auto‑Bidding

Historically, RL bidding moved from Simulation‑Based RL (SBRL) using offline simulators to Offline RL (ORL) that directly trains on online data. SBRL suffers from large simulation gaps, while ORL is constrained to the data distribution of a single online model. MBRL combines the strengths of both by building a learned environment model that can generate diverse, high‑quality virtual experiences.

3. MBRL for Budget‑Constrained Bidding (BCB)

3.1 Pipeline

1) Fit a neural‑network environment model on real data. 2) Interact the bidding policy with the model to produce virtual data, applying confidence‑based penalties to reduce over‑optimism. 3) Mix virtual and real data for a “virtual‑real fusion” policy training using the COMBO+MICRO algorithm to curb Q‑value instability.

3.2 Key Modules

Neural‑Network Environment Model : Takes state (or history) and bid action as input, outputs reward and next‑state Gaussian parameters. Trained by maximum likelihood on real logs and then frozen.

Confidence Processing : Penalizes virtual rewards proportionally to the model’s predictive variance, encouraging conservative updates.

COMBO+MICRO : COMBO uniformly samples out‑of‑distribution (s,a) pairs and suppresses their Q‑values; MICRO samples the model’s next‑state distribution and uses the minimum Q‑value to avoid optimistic bias.

3.3 Experimental Results

Online tests on Alibaba’s advertising platform show consumption +1.3 %, GMV +5.0 %, ROI +3.7 % for BCB scenarios. Ablation studies confirm that the environment model reduces offline‑inconsistency (‑68.3 % state‑MAE, ‑90 % reward‑MAE) and that COMBO+MICRO improves training stability across nine metrics.

4. Lagrange‑MBRL for Target‑ROAS

4.1 Overview

The Target‑ROAS problem adds ROI constraints to the bidding objective. A Lagrange‑dual gradient method is employed to guarantee convergence while satisfying ROI bounds, replacing heuristic reward‑shaping.

4.2 Lagrange‑Dual Gradient Method

The primal‑dual framework alternates updates of the bidding policy parameters and Lagrange multipliers. The multiplier updates automatically adjust to excess or deficit ROI, providing a theoretically‑backed constraint handling mechanism.

4.3 Results

Online experiments on the full‑site scenario achieve consumption +6.8 %, GMV +3.8 % with unchanged compliance rates. Offline ablations show that the Lagrange method steadily reduces both excess and deficit ROI errors, unlike reward‑shaping.

5. Conclusion

MBRL, together with confidence‑aware virtual data generation and the COMBO+MICRO stabilizer, delivers significant performance gains over traditional SBRL/ORL pipelines. The Lagrange‑MBRL extension further enables ROI‑constrained bidding with provable convergence.

auto-biddingbudget constrained biddingmodel-based RLonline advertisingreinforcement learningtarget ROAS
Alimama Tech
Written by

Alimama Tech

Official Alimama tech channel, showcasing all of Alimama's technical innovations.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.