How Generative Reinforcement Learning is Revolutionizing Real-Time Bidding
This article explains the core challenges of real‑time bidding, traces the evolution from PID to MPC to reinforcement learning, and details how generative reinforcement‑learning techniques such as GAVE and CBD combine diffusion models, value‑guided exploration, and score‑based return‑to‑go to dramatically improve ad‑bid efficiency and revenue.
Real‑Time Bidding Challenges
In a real‑time bidding (RTB) ad system, the bidding module must translate advertisers' goals (e.g., conversion rate, ROI) into dynamic bid decisions, directly affecting ad ranking and platform traffic allocation.
The main challenges are:
Spend wisely: keep daily spend under budget while minimizing cost per conversion.
Uncertain future: traffic and competitor behavior cannot be predicted, requiring on‑the‑fly adjustments.
Sequential impact: each bid influences future budget and ad exposure, forming a complex sequential decision problem.
Algorithmic Evolution
Kuaishou’s bidding algorithm has progressed through three generations:
PID (first generation): analogous to cruise control; adjusts bids based only on current speed versus target speed, simple but inflexible.
MPC (second generation): similar to adaptive cruise; predicts short‑term traffic conditions but still suffers from local optima.
Reinforcement Learning (third generation): like an AI driver trained on massive offline data to maximize cumulative reward, offering higher safety and better exploitation of hidden patterns.
Since 2025, generative reinforcement‑learning (RL) has been fully deployed, boosting ad revenue by over 3%.
Why Add Generative Methods?
Traditional RL in bidding is “one‑dimensional,” using only the current state. Generative models (e.g., Transformers, Diffusion) excel at modeling complex sequences but depend heavily on data quality and struggle to align with optimization objectives. Combining the two yields a “multidimensional” bid model.
Generative‑RL Framework
The framework consists of two directions:
Generative model as a world model: a digital sandbox that simulates ad outcomes under different bid strategies, generating synthetic training data.
Generative model as policy: directly models the bidding policy to better exploit sequential state information.
Decision Transformer (DT) and Diffusion Model are the two primary implementations.
Decision Transformer (DT)
DT predicts the next optimal bid by treating the problem like next‑token prediction, using historical states, actions, and rewards.
Diffusion Model
During inference, the model denoises a state‑action‑reward conditioned noise to generate future state trajectories, then back‑propagates to infer the current optimal bid.
Key Challenges of Generative Bidding
Data quality dependence: offline exploration can encounter out‑of‑distribution (OOD) issues.
Objective misalignment: generative models do not directly maximize the overall sequence reward.
GAVE Algorithm (Score‑Based RTG + Value‑Guided Exploration)
GAVE fuses a Score‑based Return‑to‑Go (RTG) module that incorporates cost‑rate constraints into every timestep, making the model adaptable to CPA, ROI, etc. It also adds a value‑function‑guided action exploration mechanism that estimates long‑term value for both original and exploratory actions, updating the policy toward the higher‑value action.
Offline experiments on the AuctionNet dataset show GAVE consistently outperforms baselines across budgets and data conditions. Online A/B tests in two scenarios (Nobid and Costcap) demonstrate significant improvements in consumption, expected consumption, and CPA compliance.
CBD Algorithm (Completer‑Aligner Diffusion)
CBD addresses the two challenges by introducing:
Completer: a diffusion‑based “digital‑sandwich” that, given a random decision step t, fills missing future states conditioned on observed history.
Aligner: a trajectory‑reward model R(x) that predicts total reward and adjusts the generated trajectory (via gradient updates) to align with the target objective before inverse dynamics inference yields the final bid.
Offline results show CBD achieving the highest total conversion value under various budgets and reward sparsity levels. Ablation studies confirm the necessity of both Completer and Aligner. Online tests reveal CBD adds only ~6 ms inference latency while delivering a 2.0% lift in expected consumption under equal spend.
Future Directions
Two major evolution paths are envisioned: (1) a foundational large‑model for bidding trained on multi‑scenario, multi‑objective data using DT or Diffusion architectures, and (2) a large‑language‑model‑based inference engine to enhance explainability and reasoning in bid decisions.
Achievements
The team’s work has been published at KDD, ICLR, ICML, NeurIPS and earned Best Paper nominations and awards, including the 2024 NeurIPS large‑scale auto‑bidding competition double‑track champion.
Kuaishou Tech
Official Kuaishou tech account, providing real-time updates on the latest Kuaishou technology practices.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
