Artificial Intelligence 13 min read

How MM‑DREX Uses Multimodal LLMs for Dynamic Expert Routing in Financial Trading

The article reviews the MM‑DREX framework, which tackles the non‑stationarity of financial markets by modeling trading as a POMDP, employing a vision‑language model‑driven dynamic router to allocate four heterogeneous experts, and demonstrates superior returns, Sharpe ratios, and drawdown control across stocks, futures, and crypto compared with 15 strong baselines.

Bighead's Algorithm Notes

Sep 14, 2025

How MM‑DREX Uses Multimodal LLMs for Dynamic Expert Routing in Financial Trading

Background

Financial markets exhibit strong non‑stationarity and multimodal information (price series, technical charts, textual analysis), challenging traditional quantitative models that rely on fixed structures and single‑modal data. Deep reinforcement learning improves temporal feature learning but still uses static policies and ignores visual patterns. Existing large language model (LLM) solutions add multimodal understanding but remain limited to question‑answer style signals and suffer from static expert designs.

Problem Definition

The trading task is formalized as a partially observable Markov decision process (POMDP) defined by the six‑tuple (S, A, T, R, Ω, O):

State space S : hidden market state sₜ at time step t.

Action space A : trading sub‑strategies (not raw buy/sell orders).

Transition kernel T : probability T(sₜ₊₁ | sₜ, aₜ) of moving to the next state.

Reward R : immediate profit‑risk utility R(sₜ, aₜ).

Observation space Ω : partial observations ωₜ that reveal sₜ.

Observation kernel O : likelihood O(ωₜ | sₜ).

The objective is to learn a policy π(aₜ | ωₜ) that maximizes the expected discounted cumulative reward Σₖγᵏ R(sₖ, aₖ).

Method

3.1 Multimodal Observation Encoding

Each observation ωₜ is represented as a three‑modal tuple:

Vₜ: visual modality (candlestick charts, technical indicator plots).

Tₜ: temporal modality (OHLCV series, derived indicators).

Lₜ: textual modality (market trend summaries, analysis).

3.2 Dynamic Router

A vision‑language large model (VLLM) receives ωₜ and simultaneously performs image understanding and causal reasoning. It detects chart patterns such as head‑and‑shoulders, double‑bottoms, and moving‑average crossovers, then combines these cues with temporal features to produce a weight vector for the four experts.

3.3 Heterogeneous Expert Layer

Four independent experts share the same multimodal observation but have separate parameters to preserve strategy diversity:

Trend expert : captures sustained up/down trends; action space Aₜᵣₑₙ𝒹 = {moving‑average crossover, momentum follow, turtle breakout}.

Reversal expert : identifies over‑bought/over‑sold reversal points; Aᵣₑᵥₑᵣₜ = {Bollinger‑band reversal, RSI swing, KDJ oscillation}.

Breakout expert : detects price breakouts from quiet periods; Aᵦᵣₑₐₖ = {volume breakout, ATR range breakout}.

Position expert : sets long‑term baseline positions (e.g., 90‑day horizon); Aₚₐₛₛᵢᵥₑ = {long‑only, short‑only, cash}.

3.4 Joint SFT‑RL Optimization

The training follows a two‑stage supervised‑fine‑tuning (SFT) plus reinforcement‑learning (RL) paradigm.

Stage 1 (SFT) : pre‑train the VLLM to classify market conditions (up, down, sideways) and initialize LoRA adapters with financial trend knowledge.

Stage 2 (RL) : initialize the router with SFT parameters and train end‑to‑end. The loss combines a clipped advantage term Lᶜˡⁱᵖ, a value‑function loss Lᵛᶠ, and an entropy regularizer S, while also incorporating expert‑specific risk‑adjusted rewards (excess return Rₑₓₜᵣₐ and maximum drawdown penalty P_dᵣₐwₙₒʷ).

Experiments

4.1 Dataset Construction

A cross‑market multimodal dataset covering US stocks, A‑shares, ETFs, cryptocurrencies, and futures (62 assets, 2017‑2025) is built. It includes:

Temporal modality : OHLCV, MA, MACD, RSI, etc.

Visual modality : candlestick and indicator charts.

Textual modality : technical analysis summaries.

The dataset contains 22,638 images, 10 exchanges, 5 asset classes, 127,474 time points, and 13 feature dimensions, surpassing SOTA baselines such as FinAgent, FinMem, and PIXIU.

4.2 Evaluation Metrics & Baselines

Metrics: total return (TR), Sharpe ratio (SR), maximum drawdown (MDD). Baselines include traditional technical indicators (B&H, MACD, KDJ‑RSI), machine‑learning models (LGBM, LSTM, Transformer), reinforcement‑learning agents (SAC, PPO, DQN), and LLM‑based methods (FinAgent, FinMem).

4.3 Main Results

Across market conditions, MM‑DREX outperforms all baselines:

US bull market : TR = 47.5 % (FinAgent = 39.34 %), SR = 1.83 (PPO = 1.73).

Policy‑sensitive A‑shares : TR = 24.36 % (11.9 % higher than FinAgent), SR = 2.15.

Noisy futures : TR = 27.31 % (CR = 32.3 %, PPO = 47.4 %).

Extreme events : During six black‑swans (COVID‑19, rate‑hike cycles, etc.) MM‑DREX loses only 4.2 % in Q1 2020 versus a 21.16 % loss for the S&P 500.

4.4 Ablation Studies

Dynamic routing effectiveness : TR = 25.75 % (MDD = 14.76 %) vs. uniform weighting (TR = 15.94 %, MDD = 21.45 %), single‑expert selection (TR = 20.11 %, MDD = 19.82 %), and random routing (TR = 9.3 %, MDD = 31.55 %).

Multimodal contribution : Removing the visual modality drops TR from 25.75 % to 20.11 % and SR from 1.63 to 1.21, while MDD rises from 14.76 % to 19.37 %. Using only visual + textual modalities yields TR = 18.37 % and MDD = 22.48 %.

Extreme risk control : In all six black‑swans, MM‑DREX’s maximum drawdown stays below that of the S&P 500, confirming the benefit of diversified experts and dynamic hedging.

Conclusion

MM‑DREX demonstrates that decoupling market‑state perception from policy execution via a VLLM‑driven dynamic router and heterogeneous experts enables adaptive sequential decision‑making in highly non‑stationary financial environments, achieving consistent gains and robust risk management across diverse assets.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

LLM Reinforcement Learning Dynamic Routing POMDP expert mixture

Written by

Bighead's Algorithm Notes

Focused on AI applications in the fintech sector

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.