5 real ways to make money online in 2026

16 min read

How Multi-Agent Reinforcement Learning Boosts Ad Computation Allocation

This article presents MaRCA, a multi‑agent reinforcement‑learning framework that allocates computation resources across the full ad‑serving chain, modeling user value, compute cost, and action rewards to maximize ad revenue while keeping system load stable under fluctuating traffic.

JD Cloud Developers

Mar 24, 2025

How Multi-Agent Reinforcement Learning Boosts Ad Computation Allocation

Background

As full‑chain optimization for search‑wide ad delivery enters a deep‑water stage, the marginal benefit of machine‑driven growth diminishes. Hundreds of billions of user requests per day must be processed within sub‑hundred‑millisecond latency, making compute resources scarce. Traffic volume and value vary across time slots, media platforms, and user groups, and most requests generate no revenue, requiring fine‑grained compute allocation toward high‑quality traffic.

Problem Modeling

Definition : At time t the system state is s_t. Each module has a load constraint C_m. The goal is to choose an action combination a_t that maximizes reward R(s_t,a_t) while respecting compute consumption C(s_t,a_t).

State space (S) : user features, traffic features, IDC information, etc.

Action space (A) : three categories of full‑chain compute actions—link‑selection, switch, and queue decisions.

Action value : expected ad consumption for a given state‑action pair.

Compute consumption : compute cost of the action.

Action reward : R(s,a)=Q(s,a)-λ·C(s,a), where λ is a Lagrange multiplier (compute‑weight factor).

The optimization reduces to a constrained linear program:

max   Σ_i Σ_a  x_{i,a}·Q(s_t,a_t)
subject to Σ_i Σ_a x_{i,a}·C(s_t,a_t) ≤ C_m
          Σ_i Σ_a x_{i,a} ≤ 1,  x_{i,a}∈{0,1}

Solving via Lagrangian dual yields the optimal action a_t* = arg max_a (Q(s_t,a)-λ_t·C(s_t,a)).

System Architecture

User‑Value Estimation Module

Predicts per‑request ad revenue and assigns value buckets. Uses a Deep Crossing Network (DCN) with Poisson loss to handle long‑tail, sparse data and performs value bucketing based on cumulative spend.

Compute‑Estimation Module

Predicts compute consumption for each action combination. Because real compute labels are scarce, the problem is split into two sub‑tasks:

Predict action results at request granularity using a DCN+MMoE model.

Estimate compute cost given the predicted results, using queue‑type measurements, switch‑type measurements, and link‑selection aggregation. Monotonic polynomial regression maps queue length to compute cost.

Action‑Value Estimation Module

Models the collaborative relationship between recall and ranking agents. Uses a Multi‑Agent Reinforcement Learning (MARL) approach with two components:

Adaptive Weighted Ensemble DRQN : DRQN (GRU‑based) handles partial observability; multiple Q‑heads are adaptively weighted based on prediction error.

Mixing Network : Aggregates individual agent Q‑values into a global Q using a Softplus‑based network, ensuring monotonicity.

Load‑Aware Decision Module

Given state s and legal actions A, selects the action a* that maximizes reward while keeping each module’s CPU load near a target C_m. System pressure is computed as the sum of the elastic‑degradation factor (converted to equivalent CPU load) and the estimated compute cost Ĉ(s,a). A feedback loop adjusts the compute‑weight factor λ using a learning rate α and an exponent k≥1 to adapt step size based on load deviation.

Experimental Results

MaRCA achieved a 14.93% increase in ad consumption across diverse traffic scenarios without additional system resources. System reliability and intelligence improved markedly, easing pressure during traffic spikes and large‑scale promotions.

Future Outlook

Load‑Aware Decision Optimization : Incorporate Model Predictive Control (MPC) to predict future states and proactively adjust the compute‑weight factor λ for sudden traffic or budget changes.

Action‑Space Expansion : Introduce new decision variables such as model selection, filtering strategies, and execution paths to further refine compute scheduling, acknowledging the increased complexity.

Broader Adoption : The solution is applicable to other recommendation pipelines with tight compute constraints, offering significant revenue potential.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

.ai load balancing multi‑agent ad optimization computation allocation

Written by

JD Cloud Developers

JD Cloud Developers (Developer of JD Technology) is a JD Technology Group platform offering technical sharing and communication for AI, cloud computing, IoT and related developers. It publishes JD product technical information, industry content, and tech event news. Embrace technology and partner with developers to envision the future.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.