Artificial Intelligence 18 min read

Sustainable Online Reinforcement Learning for Auto-bidding (SORL)

The Sustainable Online Reinforcement Learning (SORL) framework tackles offline inconsistency in auto‑bidding by iteratively gathering safe online data from real ad systems with a Lipschitz‑based exploration method and training a variance‑suppressed conservative Q‑learning policy, achieving safer, more stable, and higher‑performing bids on Alibaba’s platform.

Alimama Tech
Alimama Tech
Alimama Tech
Sustainable Online Reinforcement Learning for Auto-bidding (SORL)

Abstract: Automatic bidding has become crucial for advertisers. Existing reinforcement‑learning‑based bidding strategies are trained in simulated advertising systems (VAS) which differ from real‑world advertising systems (RAS), leading to offline inconsistency. This paper defines the offline inconsistency problem, analyzes its causes, and proposes the Sustainable Online Reinforcement Learning (SORL) framework that trains bidding policies directly by interacting with RAS.

Motivation

Real‑world ad auctions are multi‑stage, use complex mechanisms, and involve dynamic competitor behavior, while VAS are simplified, causing mismatches in bidding phase, auction mechanism, and competitor influence.

Problem Modeling

The bidding task is modeled as a constrained Markov Decision Process (CMDP) with budget constraints. State includes remaining time, budget, and consumption rate; action is the bid price bounded by upper and lower limits. The objective is to maximize total value under budget constraints.

Method: SORL Framework

SORL consists of two algorithms:

Safe and Efficient Exploration (SER): designs a safe domain around the Q‑function using its Lipschitz property and samples bids within this domain to guarantee safety while remaining efficient.

Variance‑suppressed Conservative Q‑learning (V‑CQL): an offline RL algorithm that adds regularization terms to the TD loss to suppress out‑of‑distribution overestimation and enforce a quadratic shape on the Q‑function, improving stability.

The two components are alternated iteratively: SER collects online data from RAS, V‑CQL trains the bidding policy offline, and the process repeats.

Experiments

Both simulation and online experiments on Alibaba’s advertising platform show that SER maintains safety (≤5% performance drop) and improves efficiency, while V‑CQL outperforms baselines such as USCB, BCQ, and CQL. Ablation studies confirm SER’s safety across different Q‑functions and V‑CQL’s reduced variance across random seeds.

Conclusion

The SORL framework addresses offline inconsistency by enabling direct interaction with real ad systems, delivering safer and more stable auto‑bidding policies.

reinforcement learningauto-biddingonline advertisingoffline inconsistencysafe explorationVariance Reduction
Alimama Tech
Written by

Alimama Tech

Official Alimama tech channel, showcasing all of Alimama's technical innovations.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.