How Tencent’s Bidding Algorithms Evolved from GMPC to GRB: A Deep Dive into Generative RL for Ads
The article reviews the 2025 evolution of Tencent advertising’s bidding system—from the second‑generation GMPC control algorithm through the third‑generation MRB reinforcement‑learning model to the fourth‑generation generative RL GRB—detailing architectural upgrades, multi‑channel modeling, training pipelines, and experimental gains, and outlines the 2026 AI‑agent roadmap.
Problem Definition
The advertising bidding module aims to maximize conversion value under cost and budget constraints. For each request the system considers: e_i: expected conversions if the bid wins b_i: bid price w_i: win probability c_i: cost incurred on a win
Optimization is performed via a Lagrangian dual formulation that balances volume growth against cost‑overrun penalties.
Algorithm Evolution Roadmap
Second‑generation GMPC (General Model Predictive Control)
GMPC extends the baseline PID controller with four key upgrades:
Future volume‑price prediction using the last K data points and exponential assumptions.
Historical return‑flow delay estimation to correct delayed conversions.
Model‑Predictive Control that selects the bid maximizing volume while respecting a full‑day cost cap.
Personalized exploration budget that merges a budget‑optimization layer with cost‑control, improving sensitivity during the exploration phase.
Online experiments on short‑video, mini‑games and e‑commerce mini‑stores reported up to +30% consumption and +9% achievement over the PID baseline.
Third‑generation MRB (Model‑based Reinforcement Learning Bidding)
MRB replaces deterministic control with a policy network that interacts N times with a simulated environment, generating diverse trajectories. Major upgrades over GMPC:
Trail‑and‑Error exploration: the policy samples multiple bid trajectories; a reward function aggregates cumulative gains.
Reward function adds an exploration‑budget term and a cost‑overrun penalty, encouraging volume growth without exceeding budget.
Neural‑network‑based volume‑price estimation replaces the exponential fit, allowing richer contextual features (e.g., sparse game‑payment data).
Experimental gains on the first‑day payment scenario were +24.7% consumption , +10.2% non‑cost‑over ratio and +17.1% project volume‑rate versus GMPC.
Fourth‑generation GRB (Generative Reinforcement Learning Bidding)
GRB introduces a generative pre‑training stage that learns from high‑quality historical trajectories and builds an independent generative RewardModel. Innovations include:
Iteractive‑OnlinePolicy : an online learning loop that continuously refines the policy with newly generated samples, using mean‑square‑error loss instead of pure RL loss for faster convergence.
Multi‑Channel modeling : simultaneous prediction of five placement actions (WeChat Moments, Video Channels, Official Accounts, Mini‑Programs, PCAD/Alliance) with bid factors in the range [0.5, 2.5]. Offline training is performed every 10 minutes, covering 144 time slices per day.
Reward design : placement‑specific bias penalties, a Gaussian‑shaped over‑cost penalty, and three weighted loss components – ReturnLoss, StateLoss, and ActionNllLoss (5‑dimensional).
Training data consist of ~10 million 14‑day real trajectories plus synthetic trajectories generated by interacting with the generative environment. An experience pool stores only the best trajectory per project; replacement occurs only when a new trajectory outperforms the current best, and sampling is biased toward projects that struggle to produce good trajectories.
Multi‑Channel Modeling Details
Action design outputs five placement‑specific bid factors. State design aggregates global project metrics (remaining traffic ratio, cumulative prior/post pricing rates, normalized consumption) and per‑placement metrics (consumption share, RTG score, pricing deviations). The reward penalizes placement‑specific bias while allowing each placement to converge to its own optimal pricing curve.
Iterative Online Policy Training
The experience pool is implemented as a priority dictionary: for each project only the best trajectory is retained, and replacement occurs only when a new trajectory exceeds the baseline. Sampling rates are increased for projects that generate low‑quality trajectories, accelerating convergence. This approach mirrors the Iterative‑SFT method proposed by the Princeton Chendanqi team.
Online Experiment Results (2025)
GMPC : +30% average consumption, +9% achievement.
MRB : +24.7% consumption, +10.2% non‑cost‑over ratio, +17.1% project volume‑rate.
GRB : +43.3% consumption, +23.4% non‑cost‑over ratio, +4.84 pp achievement, –4.78 pp over‑cost, +19.8% project volume‑rate, +21.25% weekly consumption growth.
2026 Outlook
Planned upgrades include a three‑dimensional experiment metric (ad‑split × traffic × budget buckets) to isolate bidding gains more precisely, and a shift toward AI‑Agent‑driven bidding. The roadmap envisions scaling the GRB policy network via scaling‑law experiments, integrating large‑language‑model knowledge into the generative component, and extending the agent’s memory for reflective/reaction capabilities.
Tencent Advertising Technology
Official hub of Tencent Advertising Technology, sharing the team's latest cutting-edge achievements and advertising technology applications.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
