How MM‑DREX Uses Multimodal LLMs for Dynamic Expert Routing in Financial Trading

The article reviews the MM‑DREX framework, which tackles the non‑stationarity of financial markets by modeling trading as a POMDP, employing a vision‑language model‑driven dynamic router to allocate four heterogeneous experts, and demonstrates superior returns, Sharpe ratios, and drawdown control across stocks, futures, and crypto compared with 15 strong baselines.

Bighead's Algorithm Notes
Bighead's Algorithm Notes
Bighead's Algorithm Notes
How MM‑DREX Uses Multimodal LLMs for Dynamic Expert Routing in Financial Trading

Background

Financial markets exhibit strong non‑stationarity and multimodal information (price series, technical charts, textual analysis), challenging traditional quantitative models that rely on fixed structures and single‑modal data. Deep reinforcement learning improves temporal feature learning but still uses static policies and ignores visual patterns. Existing large language model (LLM) solutions add multimodal understanding but remain limited to question‑answer style signals and suffer from static expert designs.

Problem Definition

The trading task is formalized as a partially observable Markov decision process (POMDP) defined by the six‑tuple (S, A, T, R, Ω, O):

State space S : hidden market state sₜ at time step t.

Action space A : trading sub‑strategies (not raw buy/sell orders).

Transition kernel T : probability T(sₜ₊₁ | sₜ, aₜ) of moving to the next state.

Reward R : immediate profit‑risk utility R(sₜ, aₜ).

Observation space Ω : partial observations ωₜ that reveal sₜ.

Observation kernel O : likelihood O(ωₜ | sₜ).

The objective is to learn a policy π(aₜ | ωₜ) that maximizes the expected discounted cumulative reward Σₖγᵏ R(sₖ, aₖ).

Method

3.1 Multimodal Observation Encoding

Each observation ωₜ is represented as a three‑modal tuple:

Vₜ: visual modality (candlestick charts, technical indicator plots).

Tₜ: temporal modality (OHLCV series, derived indicators).

Lₜ: textual modality (market trend summaries, analysis).

3.2 Dynamic Router

A vision‑language large model (VLLM) receives ωₜ and simultaneously performs image understanding and causal reasoning. It detects chart patterns such as head‑and‑shoulders, double‑bottoms, and moving‑average crossovers, then combines these cues with temporal features to produce a weight vector for the four experts.

Dynamic router weight computation
Dynamic router weight computation

3.3 Heterogeneous Expert Layer

Four independent experts share the same multimodal observation but have separate parameters to preserve strategy diversity:

Trend expert : captures sustained up/down trends; action space Aₜᵣₑₙ𝒹 = {moving‑average crossover, momentum follow, turtle breakout}.

Reversal expert : identifies over‑bought/over‑sold reversal points; Aᵣₑᵥₑᵣₜ = {Bollinger‑band reversal, RSI swing, KDJ oscillation}.

Breakout expert : detects price breakouts from quiet periods; Aᵦᵣₑₐₖ = {volume breakout, ATR range breakout}.

Position expert : sets long‑term baseline positions (e.g., 90‑day horizon); Aₚₐₛₛᵢᵥₑ = {long‑only, short‑only, cash}.

3.4 Joint SFT‑RL Optimization

The training follows a two‑stage supervised‑fine‑tuning (SFT) plus reinforcement‑learning (RL) paradigm.

Stage 1 (SFT) : pre‑train the VLLM to classify market conditions (up, down, sideways) and initialize LoRA adapters with financial trend knowledge.

Stage 2 (RL) : initialize the router with SFT parameters and train end‑to‑end. The loss combines a clipped advantage term Lᶜˡⁱᵖ, a value‑function loss Lᵛᶠ, and an entropy regularizer S, while also incorporating expert‑specific risk‑adjusted rewards (excess return Rₑₓₜᵣₐ and maximum drawdown penalty P_dᵣₐwₙₒʷ).

Joint loss formulation
Joint loss formulation

Experiments

4.1 Dataset Construction

A cross‑market multimodal dataset covering US stocks, A‑shares, ETFs, cryptocurrencies, and futures (62 assets, 2017‑2025) is built. It includes:

Temporal modality : OHLCV, MA, MACD, RSI, etc.

Visual modality : candlestick and indicator charts.

Textual modality : technical analysis summaries.

The dataset contains 22,638 images, 10 exchanges, 5 asset classes, 127,474 time points, and 13 feature dimensions, surpassing SOTA baselines such as FinAgent, FinMem, and PIXIU.

Dataset statistics
Dataset statistics

4.2 Evaluation Metrics & Baselines

Metrics: total return (TR), Sharpe ratio (SR), maximum drawdown (MDD). Baselines include traditional technical indicators (B&H, MACD, KDJ‑RSI), machine‑learning models (LGBM, LSTM, Transformer), reinforcement‑learning agents (SAC, PPO, DQN), and LLM‑based methods (FinAgent, FinMem).

4.3 Main Results

Across market conditions, MM‑DREX outperforms all baselines:

US bull market : TR = 47.5 % (FinAgent = 39.34 %), SR = 1.83 (PPO = 1.73).

Policy‑sensitive A‑shares : TR = 24.36 % (11.9 % higher than FinAgent), SR = 2.15.

Noisy futures : TR = 27.31 % (CR = 32.3 %, PPO = 47.4 %).

Extreme events : During six black‑swans (COVID‑19, rate‑hike cycles, etc.) MM‑DREX loses only 4.2 % in Q1 2020 versus a 21.16 % loss for the S&P 500.

Performance comparison
Performance comparison

4.4 Ablation Studies

Dynamic routing effectiveness : TR = 25.75 % (MDD = 14.76 %) vs. uniform weighting (TR = 15.94 %, MDD = 21.45 %), single‑expert selection (TR = 20.11 %, MDD = 19.82 %), and random routing (TR = 9.3 %, MDD = 31.55 %).

Routing ablation
Routing ablation

Multimodal contribution : Removing the visual modality drops TR from 25.75 % to 20.11 % and SR from 1.63 to 1.21, while MDD rises from 14.76 % to 19.37 %. Using only visual + textual modalities yields TR = 18.37 % and MDD = 22.48 %.

Modality ablation
Modality ablation

Extreme risk control : In all six black‑swans, MM‑DREX’s maximum drawdown stays below that of the S&P 500, confirming the benefit of diversified experts and dynamic hedging.

Risk control results
Risk control results

Conclusion

MM‑DREX demonstrates that decoupling market‑state perception from policy execution via a VLLM‑driven dynamic router and heterogeneous experts enables adaptive sequential decision‑making in highly non‑stationary financial environments, achieving consistent gains and robust risk management across diverse assets.

LLMReinforcement learningDynamic RoutingPOMDPexpert mixture
Bighead's Algorithm Notes
Written by

Bighead's Algorithm Notes

Focused on AI applications in the fintech sector

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.