Artificial Intelligence 16 min read

How Multi-Agent Reinforcement Learning Boosts Ad Computation Allocation

This article presents MaRCA, a multi‑agent reinforcement‑learning framework that allocates computation resources across the full ad‑serving chain, modeling user value, compute cost, and action rewards to maximize ad revenue while keeping system load stable under fluctuating traffic.

JD Cloud Developers
JD Cloud Developers
JD Cloud Developers
How Multi-Agent Reinforcement Learning Boosts Ad Computation Allocation

Background

As full‑chain optimization for search‑wide ad delivery enters a deep‑water stage, the marginal benefit of machine‑driven growth diminishes. Hundreds of billions of user requests per day must be processed within sub‑hundred‑millisecond latency, making compute resources scarce. Traffic volume and value vary across time slots, media platforms, and user groups, and most requests generate no revenue, requiring fine‑grained compute allocation toward high‑quality traffic.

Problem Modeling

Definition : At time

t

the system state is

s_t

. Each module has a load constraint

C_m

. The goal is to choose an action combination

a_t

that maximizes reward

R(s_t,a_t)

while respecting compute consumption

C(s_t,a_t)

.

State space (S) : user features, traffic features, IDC information, etc.

Action space (A) : three categories of full‑chain compute actions—link‑selection, switch, and queue decisions.

Action value : expected ad consumption for a given state‑action pair.

Compute consumption : compute cost of the action.

Action reward :

R(s,a)=Q(s,a)-λ·C(s,a)

, where

λ

is a Lagrange multiplier (compute‑weight factor).

The optimization reduces to a constrained linear program:

<code>max   Σ_i Σ_a  x_{i,a}·Q(s_t,a_t)
subject to Σ_i Σ_a x_{i,a}·C(s_t,a_t) ≤ C_m
          Σ_i Σ_a x_{i,a} ≤ 1,  x_{i,a}∈{0,1}</code>

Solving via Lagrangian dual yields the optimal action

a_t* = arg max_a (Q(s_t,a)-λ_t·C(s_t,a))

.

System Architecture

User‑Value Estimation Module

Predicts per‑request ad revenue and assigns value buckets. Uses a Deep Crossing Network (DCN) with Poisson loss to handle long‑tail, sparse data and performs value bucketing based on cumulative spend.

Compute‑Estimation Module

Predicts compute consumption for each action combination. Because real compute labels are scarce, the problem is split into two sub‑tasks:

Predict action results at request granularity using a DCN+MMoE model.

Estimate compute cost given the predicted results, using queue‑type measurements, switch‑type measurements, and link‑selection aggregation. Monotonic polynomial regression maps queue length to compute cost.

Action‑Value Estimation Module

Models the collaborative relationship between recall and ranking agents. Uses a Multi‑Agent Reinforcement Learning (MARL) approach with two components:

Adaptive Weighted Ensemble DRQN : DRQN (GRU‑based) handles partial observability; multiple Q‑heads are adaptively weighted based on prediction error.

Mixing Network : Aggregates individual agent Q‑values into a global Q using a Softplus‑based network, ensuring monotonicity.

Load‑Aware Decision Module

Given state

s

and legal actions

A

, selects the action

a*

that maximizes reward while keeping each module’s CPU load near a target

C_m

. System pressure is computed as the sum of the elastic‑degradation factor (converted to equivalent CPU load) and the estimated compute cost

Ĉ(s,a)

. A feedback loop adjusts the compute‑weight factor

λ

using a learning rate

α

and an exponent

k≥1

to adapt step size based on load deviation.

Experimental Results

MaRCA achieved a 14.93% increase in ad consumption across diverse traffic scenarios without additional system resources. System reliability and intelligence improved markedly, easing pressure during traffic spikes and large‑scale promotions.

Future Outlook

Load‑Aware Decision Optimization : Incorporate Model Predictive Control (MPC) to predict future states and proactively adjust the compute‑weight factor

λ

for sudden traffic or budget changes.

Action‑Space Expansion : Introduce new decision variables such as model selection, filtering strategies, and execution paths to further refine compute scheduling, acknowledging the increased complexity.

Broader Adoption : The solution is applicable to other recommendation pipelines with tight compute constraints, offering significant revenue potential.

AILoad BalancingReinforcement LearningMulti-Agentad optimizationcomputation allocation
JD Cloud Developers
Written by

JD Cloud Developers

JD Cloud Developers (Developer of JD Technology) is a JD Technology Group platform offering technical sharing and communication for AI, cloud computing, IoT and related developers. It publishes JD product technical information, industry content, and tech event news. Embrace technology and partner with developers to envision the future.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.