Artificial Intelligence 14 min read

How MERA’s Retrieval‑Augmented MoE Boosts Stock Selection Performance by 11%

The article introduces MERA, a Retrieval‑Augmented Mixture‑of‑Experts module that addresses the inability of single‑branch deep‑learning models to capture diverse stock market patterns, describes its self‑supervised pretraining, gating and expert mechanisms, and shows that it improves stock‑selection metrics by up to 11% on major Chinese indices.

Bighead's Algorithm Notes

Sep 1, 2025

How MERA’s Retrieval‑Augmented MoE Boosts Stock Selection Performance by 11%

Stock market prediction is a critical research area, but the market’s dynamic and complex nature makes accurate forecasting difficult. Conventional deep‑learning approaches use a single‑branch model that tries to fit all samples, which fails to capture the diverse patterns that appear in real markets (e.g., pandemic‑driven sector swings).

Problem Definition

The goal is to learn a function f_{\theta} that predicts future stock returns from minute‑level intraday data x_{t}^{s}. Each minute feature vector includes price, volume, turnover, etc. The target label is the daily return, normalized by a z‑score to encode ranking information.

Method Overview

The proposed MERA (Mixture of Experts with Retrieval‑Augmented Representation) combines a sparse Mixture‑of‑Experts (SMoE) module with a retrieval‑augmented (RA) representation. SMoE consists of a GateNet and several independent experts; each expert focuses on a specific stock pattern, while GateNet routes data to the most suitable expert. Because explicit pattern identifiers are unavailable, RA first converts raw noisy data into compact high‑level embeddings via self‑supervised pretraining, then retrieves similar samples to provide label signals for routing.

Self‑Supervised Pretraining

A masked auto‑encoder built on a vanilla transformer encoder learns to reconstruct randomly masked portions of the input sequence. Given an input x and mask ratio m, the model minimizes the MSE between the reconstructed segment x_{recon} and the original masked segment x_{mask}, producing a compact representation for each stock sample.

Retrieval‑Augmented Module

During training, a retrieval pool P stores the pre‑trained embeddings and discretized label categories (B=10). At inference, the query embedding e_{t}^{s} is extracted, and the top‑N (N=10) most similar samples are retrieved using MSE as the distance metric. An attention‑based aggregation computes weighted sums of the retrieved feature embeddings s_{i}^{f} and their labels, producing aggregated feature r_{t}^{s} and label embedding I_{t}^{s}:

Prediction Module

The SMoE layer receives the aggregated representation. GateNet G uses the label embedding l_{t}^{s} as a strong signal to select the top‑k experts (k=1 in experiments). The selected experts (implemented as GRU units) process the combined features (e_{t}^{s}, r_{t}^{s}). A residual connection adds the original stock embedding, and a final MLP predictor outputs the return prediction, trained with MSE loss.

Experimental Setup

Real minute‑level data from three major Chinese A‑share indices (CSI 300, CSI 500, CSI 1000) are used. The data are split chronologically into training (2018‑06‑01 to 2022‑05‑31), validation (2022‑06‑01 to 2022‑12‑31), and testing (2023‑01‑01 to 2024‑03‑31). The backbone transformer has two encoder layers with hidden dimension 128. MERA employs four experts (M=4) with one active expert (K=1) and retrieves ten nearest samples (N=10). Baselines include Transformer+TRA and Transformer+CISP. Evaluation metrics cover predictive ability (IC, RankIC, ICIR, RankIR) and portfolio performance (Return+, Return‑, Return).

Results

MERA consistently outperforms all baselines. On CSI 300, it achieves IC 0.107, RankIC 0.106, Return+ 0.726, etc.; on CSI 500, IC 0.117, RankIC 0.112, Return+ 1.018; and similar superiority on CSI 1000. These gains translate into higher portfolio returns, confirming the module’s practical value.

Further Analysis

Visualization of expert assignment shows highly sparse routing: certain label categories consistently activate specific experts, indicating that the module learns meaningful pattern‑expert mappings.

Ablation studies confirm the importance of each component. Removing the SMoE structure, the RA representation, or the label embedding each leads to noticeable performance drops, with the full MERA (SMoE + RA + label) achieving the highest IC, ICIR, and Return scores.