How Heuristic‑Guided Inverse Reinforcement Learning Boosts Portfolio Optimization

The article presents a heuristic‑guided inverse reinforcement learning framework that generates expert strategies respecting industry diversification and correlation constraints, employs a multi‑objective reward to balance return and risk, and uses a heterogeneous graph attention network to model stock relationships, achieving superior risk‑adjusted returns on CSI‑300, CSI‑500, NASDAQ‑100 and S&P‑500 benchmarks.

Bighead's Algorithm Notes
Bighead's Algorithm Notes
Bighead's Algorithm Notes
How Heuristic‑Guided Inverse Reinforcement Learning Boosts Portfolio Optimization

Background

Portfolio optimization in dynamic financial markets faces three fundamental challenges: (1) balancing risk and return under uncertainty, (2) incorporating domain‑specific financial knowledge, and (3) modeling nonlinear asset dependencies. Modern portfolio theory expresses total portfolio variance as \(\sigma_{p}^{2}=\frac{1}{N}\sigma^{2}+\frac{N-1}{N}\rho\sigma^{2}\), where \(N\) is the number of assets, \(\sigma^{2}\) the average variance, and \(\rho\) the average correlation. Traditional reinforcement learning (RL) suffers from manual reward engineering, static assumptions, and an inability to capture industry diversification and micro‑structure relationships.

Problem Definition

Existing methods either rely on static assumptions that are sensitive to estimation error, or on handcrafted rewards that cannot reflect market complexity. A framework is needed that (i) systematically integrates financial knowledge, (ii) adapts to non‑stationary market conditions, and (iii) jointly optimizes return and risk.

Method

3.1 Heuristic‑Guided Greedy Expert Policy

Rank all stocks by historical return in descending order.

Limit the number of selected stocks per industry to alpha * K, where K is the total number of stocks to pick.

Exclude a candidate if its average correlation with the current selection exceeds a threshold gamma.

Output a binary action vector a_t indicating the selected stocks.

3.2 Multi‑Objective Reward Learning

Return maximization : log‑return of the portfolio (formula shown in the paper).

Industry diversification : entropy of industry weight distribution.

Correlation management : penalize positive correlations and reward negative correlations.

Adaptive reward : weighted sum of the above components, with weights \lambda_i updated by Lagrangian dual dynamics.

3.3 Maximum‑Entropy Inverse Reinforcement Learning

The expert‑agent reward gap is modeled with a negative log‑likelihood loss. The reward function R_{\theta} is learned from expert trajectories \tau_E. The gradient of the loss with respect to the reward parameters \theta is derived analytically and used to update \theta (gradient clipping applied).

3.4 Graph‑Based Policy Network

Graph construction : three heterogeneous graphs are built from monthly Pearson correlations—industry graph A_{ind} (binary adjacency for same‑industry stocks), positive‑correlation graph A_{pos}, and negative‑correlation graph A_{neg} (discretized correlation values).

Multi‑head graph attention encoding : heterogeneous graph attention (HGAT) produces embeddings H_{ind}, H_{pos}, H_{neg} for each graph.

Heterogeneous fusion attention : the three embeddings are concatenated with the original feature matrix, weighted by learnable coefficients beta_k (softmax‑normalized), yielding a fused representation.

Policy generation : the fused embedding is passed through a fully‑connected layer and flattened to produce normalized portfolio weights.

Experimental Setup

Real‑world data from Chinese (CSI‑300, CSI‑500) and US (NASDAQ‑100, S&P‑500) markets covering Jan 2018 – Dec 2024 are used. Each dataset is split into training (2018‑2022), validation (2023) and test (2024). Features include daily OHLCV prices; a 20‑day rolling window normalizes each feature, followed by intra‑cluster normalization via k‑means. Monthly Pearson correlations construct the three graphs.

Performance Analysis

Across all four datasets the proposed method achieves the highest risk‑adjusted returns. Notable results:

CSI‑300: ARR = 0.491, 31.9 % higher than the nearest baseline iTransformer (ARR = 0.372).

NASDAQ‑100: ARR = 0.432, substantially above Transformer (ARR = 0.258).

CSI‑500: ARR = 0.710, best among all baselines.

S&P‑500: ARR slightly lower than GPT4TS but with the lowest annualized volatility (AVol) and competitive maximum drawdown (MDD).

Ablation Study

Removing the reward network reduces ARR by 38.9 % and Sharpe ratio (SR) by 34.4 %.

Omitting industry constraints raises ARR by +2.7 % but worsens drawdown (‑15.0 % vs. ‑10.1 %) and lowers the Calmar ratio.

Disabling correlation management decreases ARR and SR by 12.9 %.

Replacing HGAT with a plain MLP drops ARR by 19.1 %, confirming the importance of graph‑based attention.

Case Study

Portfolio industry allocation and stock‑correlation heatmaps are visualized for a bull market (Sept 2024) and a bear market (Oct 2024). The full model maintains balanced exposure to defensive and cyclical sectors in both regimes, while ablated variants over‑concentrate in dominant sectors. In the bull market the model sustains momentum exposure; in the bear market it shifts toward low‑beta assets, demonstrating dynamic risk‑adjusted decision making.

Paper: https://www.ijcai.org/proceedings/2025/1054.pdf

Code repository: https://github.com/ChloeWenyiZhang/SmartFolio

graph neural networkfinancial AIportfolio optimizationinverse reinforcement learningheuristic expert policymulti-objective rewardrisk-adjusted return
Bighead's Algorithm Notes
Written by

Bighead's Algorithm Notes

Focused on AI applications in the fintech sector

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.