Artificial Intelligence 13 min read

How MARS Uses Risk‑Aware Multi‑Agent RL to Master Portfolio Management

This article reviews the MARS framework, a risk‑aware multi‑agent reinforcement‑learning system for automated portfolio management that tackles market non‑stationarity and proactive risk control, detailing its hierarchical architecture, formal MDP formulation, training process, and superior experimental results on DJIA and HSI benchmarks.

Bighead's Algorithm Notes

Sep 25, 2025

How MARS Uses Risk‑Aware Multi‑Agent RL to Master Portfolio Management

Background – Deep reinforcement learning (DRL) has shown promise for automated portfolio management, yet existing methods struggle with two core challenges: (1) non‑stationary market dynamics that violate the Markov decision process (MDP) assumption, causing models to fail when market regimes shift; and (2) insufficient risk handling, where risk is only penalized after the fact rather than being integrated into the decision process.

Problem Definition – The paper aims to solve (a) the inability of traditional DRL models to adapt to changing market conditions and (b) the lack of proactive risk management, which makes agents vulnerable to tail‑risk events.

Method

MARS addresses these issues with a two‑layer architecture consisting of a Heterogeneous Agent Ensemble (HAE) and a Meta‑Adaptive Controller (MAC).

Overall Architecture

The input is a market state vector s_t (cash balance, holdings, technical indicators). The HAE generates diverse action proposals a_t^i from multiple agents, each endowed with a distinct risk preference defined by a safety‑critic network and a risk‑tolerance threshold. The MAC receives the same market state and outputs dynamic weights w_t that coordinate the agents. The final action A_t is a weighted sum of the proposals, which is then passed through a risk‑coverage layer (position‑concentration limits, cash buffers, short‑selling bans) to produce the executable action A_t'.

Problem Formalization

The portfolio management task is modeled as an MDP M=(S,A,P,R,γ) where:

State space S : s_t includes cash b_t, holdings h_t^i, and technical indicators (e.g., MACD, RSI).

Action space A : A_t lies in [-1,1], representing normalized portfolio‑allocation changes.

Reward R_t

The reward combines return, transaction cost C_t, and a risk penalty ρ_t that incorporates 30‑day volatility σ_{30d} and maximum drawdown DD_{30d}. The objective is to maximize the expected discounted cumulative reward.

HAE Details

Each of the N agents A_i follows an extended DDPG architecture with three networks:

Actor π_{φ_i}(s_t) : a 256‑128‑64 MLP with ReLU, outputting deterministic action a_t^i. The policy gradient includes a Conditional Safety Penalty (CSP) that activates when the predicted risk C_{xi_i} exceeds the agent’s threshold θ_i.

Critic Q_{ψ_i}(s_t,a_t) : a 256‑128‑64 MLP minimizing TD error.

Safety‑Critic C_{xi_i}(s_t,a_t) : a 256‑128‑64 MLP predicting external risk based on an environment risk function C_{env}, trained with mean‑squared error.

The HAE also aggregates portfolio‑concentration (HHI), leverage, and simulated volatility to provide comprehensive risk signals.

MAC Details

The MAC acts as a high‑level coordinator. It takes s_t and outputs agent weights w_t. The MAC is trained to maximize a risk‑adjusted utility similar to a Sharpe‑ratio term plus the risk penalty, using weighted averages of Q‑values ( \bar{Q}_t) and risk estimates ( \bar{C}_t) with a meta‑parameter λ_{meta}.

Trading Flow

Construct market state s_t.

HAE generates action suggestions a_t^i; MAC generates weights w_t.

Aggregate actions into A_t.

Apply risk‑coverage adjustments (position limits, cash buffer, short‑selling ban) to obtain executable action A_t'.

Execute A_t' and update to state s_{t+1}.

Experiments

Setup – Datasets: 50 constituents of the Dow Jones Industrial Average (DJI) and the Hang Seng Index (HSI) covering 2022 (bear market) and 2024 (bull market). Evaluation metrics: cumulative return (CR), annualized return (AR), annualized volatility (AVol), maximum drawdown (MDD), Sharpe ratio (SR). Baselines: market index (buy‑and‑hold), DeepTrader (risk‑aware DRL), HRPM (hierarchical RL), AlphaStock (attention‑based policy). Ablation variants: MARS‑Static (fixed weights), MARS‑Homogeneous (identical agents), MARS‑Divergence (5/15 agents). Implementation details: HAE with 10 agents, each network 256‑128‑64, risk‑penalty weights w_{vol}=0.5, w_{dd}=2.0, λ_{meta}=0.5, position‑concentration limit 20%.

Results

DJI 2022 (bear): MARS CR = ‑0.86% (baseline worst = ‑36.37%), MDD = ‑16.77% (baseline worst = ‑46.17%), SR = ‑0.05 (baseline best = ‑1.03).

DJI 2024 (bull): MARS CR = 29.50% (baseline best = 22.13%), SR = 2.84 (baseline best = 1.41), MDD = ‑5.39% (baseline worst = ‑10.24%).

HSI 2022 (bear): MARS CR = ‑14.50% (baseline worst = ‑26.69%), volatility = 22.56% (baseline worst = 39.32%), MDD = ‑32.72% (baseline worst = ‑54.60%).

HSI 2024 (bull): MARS SR = 1.49 (baseline best = 1.10), AR = 17.84% (outperforming DRL baselines).

Ablation Studies

MAC effectiveness : MARS‑Static drops CR from 29.50% to 17.10% and SR from 2.84 to 1.71, confirming the importance of dynamic coordination.

HAE heterogeneity : MARS‑Homogeneous yields CR = 22.21% (vs. 29.50% for full MARS) and MDD = ‑7.81% (vs. ‑5.39%), showing diverse risk preferences improve performance.

Ensemble size : MARS‑Div5 (5 agents) CR = 12.02%, MARS‑Div15 (15 agents) CR = 19.70%; both lower than the 10‑agent setting (29.50%), indicating 10 agents balance diversity and coordination.

Adaptive Strategy Analysis

During the 2022 bear market, MAC rapidly shifts weights toward conservative agents (daily volatility ≈ 70%) to avoid risk. In the 2024 bull market, weight adjustments become smoother, with a stronger negative correlation between conservative and aggressive agents (correlation improves from ‑0.788 to ‑0.968), demonstrating MAC’s ability to adapt to market regimes.