Can Few-Shot Reinforcement Learning Supercharge Budget-Constrained Auto-Bidding?
This paper introduces ABPlanner, a few‑shot, context‑aware budget planner that enhances budget‑constrained auto‑bidding in online advertising by hierarchically allocating budgets across short‑term stages and training a sequential decision‑maker with deep reinforcement learning, achieving significant gains in simulated and real‑world A/B tests.
Abstract
This work studies the automatic bidding problem in online advertising, focusing on how to obtain high‑level budget allocation strategies through context‑based learning so that advertisers can quickly adapt personalized bidding models with only a few samples (few‑shot).
We propose ABPlanner (Adaptable Budget Planner), a few‑shot‑adaptable budget planner that improves the effectiveness of budget‑constrained auto‑bidding. ABPlanner builds on a hierarchical bidding framework that splits the bidding horizon into multiple short‑term stages, allocating budget to each stage and guiding the underlying bidding model to follow the plan. By treating the planner as a sequential decision‑maker, it adjusts the budget plan each round based on historical data, achieving high sample efficiency and rapid adaptation to new advertisers.
1. Introduction
Real‑time bidding (RTB) is crucial in online advertising; each impression triggers an instantaneous auction. Advertisers use platform‑provided auto‑bidding services to maximize cumulative display value under budget constraints. However, the bidding environment varies across advertisers, making personalized bidding models challenging. Existing approaches treat the problem as an online stochastic knapsack, aiming to win high‑value impressions while dealing with unknown values and prices.
2. Hierarchical Bidding Framework
We consider the budget‑constrained auto‑bidding problem where an advertiser participates in sequential auctions to maximize the total value of won impressions within a budget. Let t denote the number of impressions (typically unknown), each with value v_i and market price p_i. The auto‑bidder submits a bid; if the bid exceeds the market price, the impression is won, yielding value v_i and incurring cost p_i. The objective can be formalized as maximizing expected cumulative value subject to the budget.
Key challenges include the stochastic nature of values and prices and the long, high‑frequency decision horizon. To mitigate these issues, we introduce a hierarchical framework: a high‑level budget planner allocates budget across K stages (defined by time or impression count), providing a budget plan for each stage. The low‑level auto‑bidder operates within the stage using the allocated budget as a constraint or reference.
3. Adaptive Budget Planner (ABPlanner)
Simple budget‑allocation methods that fit a budget‑return curve per stage require abundant historical data and suffer from low sample efficiency. ABPlanner addresses this by modeling the planner as a sequential decision‑maker in a Markov Decision Process (MDP). At each bidding round, the planner observes a state consisting of total remaining budget, historical allocations, accumulated reward, and cost, then selects an action representing the direction of budget adjustment. The environment transition is driven by the underlying auto‑bidder’s actions, which generate stage‑level reward and cost. The reward function encourages higher cumulative reward over bidding rounds.
ABPlanner is trained with deep reinforcement learning (e.g., Proximal Policy Optimization). The training details follow the original paper.
State: total budget, historical budget distribution, cumulative reward, and cost.
Action: adjustment direction for the budget plan.
Transition: the low‑level auto‑bidder executes bids, producing stage reward and cost.
Reward: encourages increasing cumulative reward across rounds.
The planner’s objective is to maximize the expected cumulative reward over T bidding rounds, where the expectation is taken over the advertiser’s stochastic environment.
4. Experiments
4.1 Simulated Experiments
We evaluate ABPlanner in two simulated environments: one using synthetic data and another semi‑simulated environment built from real data. Results show that as bidding cycles progress, ABPlanner increasingly leverages historical information, substantially improving advertisers’ cumulative returns. In the semi‑simulated setting, ABPlanner also learns to identify high‑ROI time windows and tilt the budget accordingly.
4.2 Online Experiments
In the online experiment, a day is divided into K hourly stages, with each day constituting a bidding round. The baseline auto‑bidder uses linear programming. ABPlanner continuously collects data during inference and updates its policy.
Results indicate that ABPlanner outperforms baselines on most metrics, especially boosting conversions in the latter two days and reducing total cost on most days, thereby increasing platform revenue. Conversion numbers continue to rise from the fourth day onward, confirming the effectiveness of the predictive‑conversion‑rate‑driven optimization.
5. Conclusion
ABPlanner is a few‑shot‑adaptable budget planner that enhances budget‑constrained auto‑bidding in online advertising. By decomposing the bidding horizon into stages and generating a high‑level budget plan, ABPlanner captures temporal dynamics and simplifies the low‑level bidding decision. Modeled as an MDP, it treats each budget adjustment as an action and leverages limited historical rounds to dynamically optimize allocation. Extensive simulations and real‑world A/B tests validate its effectiveness and adaptability, and deployment in a production ad system demonstrates practical feasibility. Future work includes jointly optimizing the planner and the underlying auto‑bidder for tighter coordination and exploring finer‑grained, stage‑wise dynamic budget adjustments.
References
[1] Balseiro, S. R., & Gur, Y. (2019). Learning in repeated auctions with budgets: Regret minimization and equilibrium. Management Science, 65(9), 3952–3968.
[2] Wang, H., et al. (2023). HiBid: A cross‑channel constrained bidding system with budget allocation by hierarchical offline deep reinforcement learning. IEEE Transactions on Computers.
[3] Li, P., Hawbani, A., et al. (2018). An efficient budget allocation algorithm for multi‑channel advertising. In ICPR 2018, 886–891.
[4] Schulman, J., et al. (2017). Proximal policy optimization algorithms. arXiv:1707.06347.
[5] He, Y., Chen, X., Wu, D., et al. (2021). A unified solution to constrained bidding in online display advertising. In Proceedings of KDD ’21, 2993–3001.
Alimama Tech
Official Alimama tech channel, showcasing all of Alimama's technical innovations.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
