Paper Review: THEME – Thematic Investing via Stock Semantic Embeddings and Temporal Dynamics

The article reviews the THEME framework, which tackles static and coverage limitations of traditional thematic investing by constructing a large Thematic Representation Set (TRS) and applying a two‑stage hierarchical contrastive learning process that first aligns stock text embeddings with theme semantics and then refines them with short‑term return dynamics, achieving superior retrieval and portfolio performance across extensive experiments.

Bighead's Algorithm Notes
Bighead's Algorithm Notes
Bighead's Algorithm Notes
Paper Review: THEME – Thematic Investing via Stock Semantic Embeddings and Temporal Dynamics

Background

Thematic investing aims to build portfolios aligned with structural trends such as AI or renewable energy. Existing approaches rely on static ETF constituents or expert lists, which leads to poor adaptability, limited coverage of emerging or niche themes, and inadequate semantic representation of financial texts.

Problem Definition

Three core issues are identified: (1) Static nature – inability to reflect new companies or shifting relevance; (2) Coverage bias – ETF components favor popular sectors and neglect small or emerging themes; (3) Semantic‑dynamic separation – generic embeddings (e.g., BERT) fail to capture finance‑specific semantics, and models do not jointly consider thematic relevance and short‑term return dynamics.

Method

The proposed THEME framework addresses these problems through two main components.

3.1 Thematic Representation Set (TRS)

TRS starts from 1,153 real thematic ETFs covering ~3,000 U.S. stocks and expands coverage by:

Industry classification augmentation : integrates standard industry taxonomies to reach ~200 themes, including niche topics.

Financial news enrichment : aggregates SEC filings and news to create multi‑theme textual portraits for each stock.

Dynamic updates : continuously incorporates new thematic information.

Each record contains theme tags, a text summary, a component‑stock list, and each stock links to multiple textual portraits.

3.2 Hierarchical Contrastive Learning

The framework consists of two training stages.

3.2.1 Semantic Alignment

A pretrained financial text encoder (e.g., Fin‑E5 or SFR‑Embedding‑Mistral) is adapted with LoRA parameters θ to produce an adapted model f'₍θ₎. Theme descriptions tᵢ are encoded into embeddings zᵢ, and stock portraits sⱼ into embeddings hⱼ. A contrastive loss pulls positive stock embeddings (components of the theme) toward the theme anchor and pushes non‑components away, using cosine similarity sim and temperature τ.

3.2.2 Temporal Refinement

A lightweight two‑layer adapter A₍φ₎ takes the semantic stock embedding hⱼ and its past L‑day return series rⱼ, outputting a time‑refined embedding that incorporates short‑term return signals. A triplet loss selects, for each theme, a positive stock sₚ with higher future H‑day return and a negative stock sₙ with lower return, enforcing margin m.

3.3 Inference

Given a user query q, the adapted encoder generates a query embedding z_q. Pre‑computed time‑refined stock embeddings h'_j are ranked by cosine similarity, and the top‑K stocks are returned for portfolio construction.

3.4 System Implementation

The system uses a modular cloud‑native architecture that scales linearly with the number of stocks. It exposes a REST API for integration with research platforms and automated portfolio engines, supporting real‑time signals such as ESG events or earnings releases.

Experiments

4.1 Experimental Setup

Datasets: TRS (969 ETFs split into train/validation/test, expanded to 196 themes) and two years of U.S. stock history (L = 60‑day return window, H = 14‑day prediction horizon). Evaluation metrics include retrieval quality (HR@k, P@k) and portfolio performance (cumulative return (CR), Sharpe ratio (SR), maximum drawdown (MDD)). Baselines comprise text‑embedding models (Fin‑E5, SFR‑Embedding‑Mistral), large language models (GPT‑4.1, Gemini‑2.5), and an ETF‑only dataset.

4.2 Retrieval Performance

THEME consistently outperforms baselines. Using Linq‑Embed‑Mistral, HR@3 improves from 0.5155 to 0.8196 and P@3 from 0.3522 to 0.6289. Small models (e.g., bge‑small‑en‑v1.5) enhanced by THEME surpass larger LLM baselines, confirming the benefit of supervised contrastive learning.

4.3 Portfolio Construction

Equal‑weight portfolios (K = 3/5/10) are evaluated over a rolling window (2024‑04‑23 to 2025‑04‑29). THEME‑augmented portfolios achieve higher Sharpe ratios, cumulative returns, and lower drawdowns than baselines and even outperform real ETF portfolios (average SR = 0.4845, CR = 0.0672, MDD = ‑0.2368). With gte‑Qwen2‑7B‑instruct, SR@3 rises from 0.5014 to 0.7592 and CR@3 from 0.0917 to 0.1645.

4.4 Ablation Studies

Anchor strategy : Using themes as anchors yields significantly higher precision than using stocks (e.g., P@3 improves from 0.2216 to 0.4124 with gte‑Qwen2‑1.5B). Dataset comparison : Models trained on TRS outperform those trained on ETF‑only data; bge‑small‑en‑v1.5’s P@5 gains 0.1186, and multilingual‑e5‑large‑instruct’s P@3 gains 0.0705, demonstrating the importance of expanded thematic coverage.

Overall, THEME provides a scalable, adaptive solution for thematic stock retrieval and portfolio construction by jointly modeling semantic relevance and short‑term return dynamics.

financial AIportfolio optimizationhierarchical contrastive learningstock semantic embeddingstemporal dynamicsthematic investing
Bighead's Algorithm Notes
Written by

Bighead's Algorithm Notes

Focused on AI applications in the fintech sector

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.