Sustainable Online Reinforcement Learning for Auto-bidding (SORL)
The Sustainable Online Reinforcement Learning (SORL) framework tackles offline inconsistency in auto‑bidding by iteratively gathering safe online data from real ad systems with a Lipschitz‑based exploration method and training a variance‑suppressed conservative Q‑learning policy, achieving safer, more stable, and higher‑performing bids on Alibaba’s platform.
Abstract: Automatic bidding has become crucial for advertisers. Existing reinforcement‑learning‑based bidding strategies are trained in simulated advertising systems (VAS) which differ from real‑world advertising systems (RAS), leading to offline inconsistency. This paper defines the offline inconsistency problem, analyzes its causes, and proposes the Sustainable Online Reinforcement Learning (SORL) framework that trains bidding policies directly by interacting with RAS.
Motivation
Real‑world ad auctions are multi‑stage, use complex mechanisms, and involve dynamic competitor behavior, while VAS are simplified, causing mismatches in bidding phase, auction mechanism, and competitor influence.
Problem Modeling
The bidding task is modeled as a constrained Markov Decision Process (CMDP) with budget constraints. State includes remaining time, budget, and consumption rate; action is the bid price bounded by upper and lower limits. The objective is to maximize total value under budget constraints.
Method: SORL Framework
SORL consists of two algorithms:
Safe and Efficient Exploration (SER): designs a safe domain around the Q‑function using its Lipschitz property and samples bids within this domain to guarantee safety while remaining efficient.
Variance‑suppressed Conservative Q‑learning (V‑CQL): an offline RL algorithm that adds regularization terms to the TD loss to suppress out‑of‑distribution overestimation and enforce a quadratic shape on the Q‑function, improving stability.
The two components are alternated iteratively: SER collects online data from RAS, V‑CQL trains the bidding policy offline, and the process repeats.
Experiments
Both simulation and online experiments on Alibaba’s advertising platform show that SER maintains safety (≤5% performance drop) and improves efficiency, while V‑CQL outperforms baselines such as USCB, BCQ, and CQL. Ablation studies confirm SER’s safety across different Q‑functions and V‑CQL’s reduced variance across random seeds.
Conclusion
The SORL framework addresses offline inconsistency by enabling direct interaction with real ad systems, delivering safer and more stable auto‑bidding policies.
Alimama Tech
Official Alimama tech channel, showcasing all of Alimama's technical innovations.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.