Dual‑Phase RL‑LLM Framework DARA for Few‑Shot Online Advertising Budget Allocation

The DARA framework splits online advertising budget allocation into a few‑shot LLM reasoning stage and a fine‑grained optimizer stage, enhanced by a dynamically updated RL‑fine‑tuning algorithm (GRPO‑Adaptive), achieving significantly lower ROI variance than traditional baselines in both real and simulated environments.

Alimama Tech
Alimama Tech
Alimama Tech
Dual‑Phase RL‑LLM Framework DARA for Few‑Shot Online Advertising Budget Allocation

Abstract

Online advertising requires maximizing cumulative exposure value under a fixed budget, but advertisers often have only limited historical data, making traditional reinforcement‑learning (RL) methods ineffective in few‑shot scenarios. Large language models (LLMs) can generalize from few examples but lack numerical precision. DARA (Dual‑phase Adaptive Reasoning and Allocation) decomposes budget allocation into two stages: a few‑shot reasoner that generates a high‑level plan from limited history, and a fine‑grained optimizer that iteratively adjusts the plan using real‑time feedback. An RL‑fine‑tuning strategy called GRPO‑Adaptive periodically updates a reference model to improve LLM inference accuracy. Experiments on real and simulated environments show that DARA markedly reduces the variance of marginal ROI, outperforming traditional baselines.

Introduction

In real‑time bidding (RTB), advertisers must allocate a fixed budget across multiple time slots while the marginal return diminishes with increased spend. Hand‑crafted heuristics or single‑model RL struggle to generalize in dynamic settings. Recent LLMs excel at few‑shot reasoning via contextual prompts but are numerically insensitive, limiting precise budget optimization.

Budget planning naturally comprises two sub‑problems: (1) extracting patterns from limited historical data to produce a high‑level plan, and (2) refining that plan with real‑time feedback. A single model cannot simultaneously satisfy both, motivating a split‑task design.

DARA adopts a dual‑stage architecture. Early decisions rely on few‑shot generalization, while later adjustments require sensitivity to feedback. To address LLM numerical insensitivity, a dynamic KL‑regularized RL fine‑tuning algorithm (GRPO‑Adaptive) periodically refreshes the reference model, preventing degradation. Both a real‑data environment and a controllable simulation environment are constructed for diverse training scenarios.

Preliminaries

Budget Allocation Model

Given a total budget B divided into T time slots, let b_i be the spend in slot i and r_i the expected marginal ROI. Overall ROI is the sum of r_i divided by B. Assuming each r_i(b_i) is differentiable, strictly increasing, and exhibits diminishing convexity, the optimal condition equalizes marginal ROI across slots. Consequently, the objective can be approximated as minimizing the variance of marginal ROI across slots.

Related Work

Traditional budget allocation methods include hand‑crafted rules and RL‑based models, which often ignore inter‑slot interactions or adapt slowly. Hierarchical RL (HiBid) separates high‑level allocation from low‑level bidding but lacks online adaptation. LSTM‑based ABPlanner suffers from data scarcity and dynamic environments. Recent LLMs demonstrate strong few‑shot learning via prompts, yet a single LLM lacks numerical sensitivity for complex decision tasks. These gaps motivate a dual‑stage architecture with RL fine‑tuning.

Method

Environment Modeling and Problem Definition

Two training environments are built: (1) a real‑data environment constructed from enterprise‑level advertising logs that mimics actual market cost and ROI dynamics, and (2) a simulation environment that generates budget‑return curves using controllable polynomial or exponential functions. In both environments, the task is formalized as: given a few‑shot historical dataset {(b_i, r_i)}_{i=1}^N, generate a new budget vector b' under the total budget constraint B such that the variance of marginal ROI is minimized.

Figure 1
Figure 1
Figure 2
Figure 2

Few‑Shot Prompting

A structured prompt template concatenates the task description, few‑shot examples, historical attempt records, and the desired output format. The LLM receives this prompt and produces a budget vector together with an explanatory rationale, enhancing interpretability.

Prompt Template
Prompt Template

Dual‑Stage Collaborative Agents

DARA splits the task into two stages:

Few‑shot Reasoner : Generates the initial budget plan for the first day based on limited historical records, focusing on global trends.

Fine‑grained Optimizer : From the second day onward, adjusts the budget locally using marginal ROI feedback. It maintains a sliding window of recent decisions and feedback to dynamically update its strategy.

Algorithm 1 (not reproduced) summarizes the workflow: the reasoner first proposes a plan, then the optimizer iteratively refines it each cycle until a termination condition is met. This decouples generalization from precise optimization, allowing different LLMs to specialize.

Dual‑Stage Architecture
Dual‑Stage Architecture

RL Fine‑Tuning Strategy: GRPO‑Adaptive

Standard GRPO (Group Relative Proximal Optimization) stabilizes policy updates via group‑wise advantage estimation and KL‑regularization. Experiments reveal that keeping the reference policy static throughout training causes the model’s numerical reasoning to degrade: as the current policy diverges, the KL term increasingly pulls the policy back toward an outdated baseline, harming performance on multi‑step numerical tasks like budget allocation.

GRPO‑Adaptive addresses this by periodically snapshotting the current policy and replacing the static reference model, then resetting the KL baseline. This dynamic reference prevents the “pull‑back” effect and balances reward maximization with policy stability.

Dynamic Reference Update Procedure

Snapshot current policy parameters (deep copy).

Replace the static reference model with this snapshot, creating a dynamic reference.

Reset the KL divergence constraint to compute from the new reference point.

Clipped Advantage‑Weighted Objective

GRPO‑Adaptive retains GRPO’s clipped advantage‑weighted loss. For each sampled trajectory, advantages are normalized within groups, enabling the model to identify superior outputs without a global critic. PPO‑style clipping prevents overly large single‑step updates.

Training Flow Design

To ensure generalization across diverse scenarios, a multi‑environment rotation sampling mechanism is introduced: after a fixed number of steps, training switches to a new simulated environment with a fresh marginal ROI curve, preventing over‑fitting to a single distribution. The full training pipeline is illustrated in Algorithm 2 (shown as a diagram).

Training Pipeline
Training Pipeline

Experiments

The primary metric is marginal ROI variance (lower is better, indicating more balanced allocation). Results are reported on both a real advertising platform dataset and the simulated environment.

On real data, DARA outperforms baselines DPO, HiBid, Q‑MCKP, and ABPlanner at every step; the fine‑grained optimizer further reduces variance in later stages.

Single‑stage LLMs (with or without RL fine‑tuning) exhibit higher variance; the dual‑stage architecture yields substantial gains even without RL.

Adding RL fine‑tuning to the dual‑stage model lowers variance further, showing that RL helps the reasoner produce more strategic initial plans.

Sensitivity analysis shows DARA is robust to the number of time slots (2–10) with 10.6%–12.2% revenue improvement. The frequency of reference model updates significantly impacts performance; a moderate update interval (e.g., every 60 steps) achieves the lowest variance, while too frequent or too sparse updates degrade results.

Experimental Results
Experimental Results

Conclusion and Future Work

DARA demonstrates that decomposing budget allocation into a few‑shot reasoning stage and a fine‑grained optimization stage, coupled with a dynamically updated RL fine‑tuning algorithm, significantly improves numerical reasoning and stability in online advertising. Key contributions include the dual‑agent architecture, the GRPO‑Adaptive dynamic reference update, and the construction of both real and simulated environments for robust training.

Future directions involve extending the framework to multi‑platform budget coordination, incorporating attention mechanisms for temporal feature capture, testing transferability across LLM scales, and reducing training costs while maintaining deployment efficiency.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

LLMreinforcement learningfew-shot learningonline advertisingbudget allocation
Alimama Tech
Written by

Alimama Tech

Official Alimama tech channel, showcasing all of Alimama's technical innovations.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.