Artificial Intelligence 12 min read

Exploring CSMD: A China‑Specific Multimodal Stock Dataset and the LightQuant Quantitative Framework

The article introduces CSMD, a high‑quality multimodal dataset built from Chinese financial news for the CSI‑300 and SSE‑50 stocks, describes LLM‑enhanced factor extraction and rigorous data validation, presents the modular LightQuant framework, and shows through extensive experiments that CSMD and LightQuant outperform existing resources such as CMIN‑CN in stock trend prediction and backtesting.

Bighead's Algorithm Notes

Apr 20, 2026

Exploring CSMD: A China‑Specific Multimodal Stock Dataset and the LightQuant Quantitative Framework

Background

Stock market analysis traditionally relies on fundamental analysis, which suffers from lag and subjectivity, or technical analysis, which is noisy and struggles to predict future trends. Combining price information with financial text as multimodal data offers a more direct way to capture market signals.

Problem Definition

Publicly available datasets focus on the U.S. market and English sources, making them unsuitable for Chinese stock analysis. Chinese financial news is scarce, noisy, and requires labor‑intensive preprocessing. Existing open‑source backtesting frameworks are complex and have steep learning curves, hindering rapid strategy development.

Method

Dataset Construction

Financial News Collection : All articles from Securities Times were harvested using a scalable automated pipeline that parses, normalizes, and applies exponential‑decay error handling to ensure clean, structured data. Two subsets were created: CSMD‑300 (based on the CSI‑300 index) and CSMD‑50 (based on the SSE‑50 index), covering 300 and 50 representative stocks respectively.

LLM‑Enhanced Factor Extraction : Domain‑specific background knowledge was incorporated into prompts for a large language model, producing human‑readable, interpretable, and influential factors from each news item. These factors were validated and prioritized, yielding higher readability and explanatory power than raw text.

Data Quality Validation : Quality was assessed across five dimensions—denoising, financial sentiment expression, text density, human readability, and LLM readability—using a combination of expert evaluation (five finance experts rating coherence, relevance, and accuracy) and automated scoring (MiniLM‑L6‑v2 for text ranking and GPT‑4 for coherence, information content, and topic depth).

LightQuant Framework

LightQuant is a lightweight, user‑friendly simulation platform with a modular architecture comprising three layers:

Data Layer : Unified interface for extracting, processing, and storing heterogeneous market data and financial news, integrating feature engineering and a factor library.

Model Layer : Supports development and integration of various prediction models, offering standardized APIs for loading, training, and inference.

Evaluation Layer : Provides backtesting and performance‑analysis tools, delivering metrics for efficient strategy verification and optimization.

Experiments

Experimental Setup

Downstream tasks—stock‑trend prediction and backtesting—were evaluated using the LightQuant framework on CSMD. Models were divided into single‑modal and multimodal categories, employing state‑of‑the‑art architectures.

Models

Single‑modal: LSTM, BiLSTM, ALSTM (attention‑augmented), Adv‑LSTM (adversarial training), SCINet (recursive down‑sampling), DTML (Transformer‑based temporal modeling).

Multimodal: StockNet (price + text joint modeling), HAN (hierarchical attention), PEN (shared‑representation fusion).

Metrics

Accuracy (ACC), Matthews Correlation Coefficient (MCC), Annualized Return Rate (ARR), Sharpe Ratio (SR), Maximum Drawdown (MDD), and Calmar Ratio (CR) were used to assess predictive accuracy, profitability, and risk control.

Results

Both CSMD‑300 and CSMD‑50 consistently outperformed the widely used CMIN‑CN dataset across most models in stock‑trend prediction, demonstrating superior data quality, richer multimodal features, and better representations obtained through careful curation.

In backtesting, StockNet achieved the highest ARR and Calmar Ratio, ALSTM obtained the best Sharpe Ratio, and HAN recorded the lowest maximum drawdown, highlighting the dataset’s ability to support effective real‑world trading strategies.

Overall, the experiments confirm that the CSMD dataset and LightQuant framework provide strong support for Chinese stock‑market analysis, improving research efficiency and prediction accuracy compared with existing datasets and frameworks.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

stock prediction financial backtesting CSMD LightQuant LLM factor extraction multimodal finance

Written by

Bighead's Algorithm Notes

Focused on AI applications in the fintech sector

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.