Can Large Vision‑Language Models Really Understand Candlestick Charts?
This paper builds a multi‑scale candlestick‑chart dataset and a standardized evaluation framework to measure how well visual language models (VLMs) extract price information, using confusion‑matrix diagnostics and Information Coefficient (IC) metrics, and finds that VLMs excel only on monotonic trends and struggle with precise time‑based predictions.
Background
Visual language models (VLMs) have been applied to stock‑price prediction, but existing multimodal benchmarks mix visual and textual signals, preventing isolation of visual contribution and ignoring the multi‑time‑scale analysis used by professional traders.
Problem Definition
Limitations of existing benchmarks : cannot isolate visual contribution, lack rigorous ablation, and ignore multi‑scale analysis.
Missing multi‑scale analysis : without short‑ and long‑term visual market signals, a model’s ability to capture the full spectrum of market dynamics remains unknown.
Method
Dataset construction
Daily OHLCV records from TuShare (Chinese A‑share) and Yahoo Finance (US S&P 500) covering 2015‑2025 are used. For each stock and cutoff date, multi‑frequency candlestick images are generated by reconstructing OHLCV data, selecting the desired frequency (daily or weekly), truncating at the cutoff, limiting the number of candles displayed, and rendering the chart. Aligned numeric time‑series features are also built to support baseline models.
Benchmark task definition
Regression: predict the 30‑day forward return r = (P_{t+30} - P_t) / P_t from a pair of daily and weekly candlestick images of the same stock on the same date.
Prompt engineering
The model is assigned the role of a “high‑discrimination stock‑trend analyst” and instructed to output a single numeric score in the range [-0.5, 1.0], rounded to three decimals and wrapped in <score> tags. Few‑shot examples are provided to guide the model.
Evaluation metrics
Confusion‑matrix metrics: TP, TN, FP, FN, accuracy, precision, recall, specificity, F1.
Information Coefficient (IC): Spearman rank correlation between predicted scores and actual returns, reported as average IC, median IC, IC‑IR, and proportion of statistically significant ICs.
Experiments
Dataset statistics
The final dataset contains 69,744 candlestick samples and 2,005,279 raw OHLCV records, covering 300 CSI 300 constituents and 500 S&P 500 constituents across 32 cutoff dates.
Experimental setup
Test window: 2023‑01‑01 to 2025‑01‑01. Evaluation is performed primarily on CSI 300 stocks with a subset on S&P 500. Seven commercial VLMs are compared against an XGBoost baseline trained on the aligned numeric features.
Results
Classification performance : Claude‑Haiku achieves high precision (suitable for conservative strategies), while GPT‑5mini achieves high recall (suitable for aggressive strategies).
Prediction bias analysis : Different architectures exhibit distinct directional biases; integrating models with complementary biases mitigates overall bias.
Information Coefficient analysis : VLMs obtain higher average and median IC, as well as higher IC‑IR, than XGBoost, indicating stronger ability to predict return magnitude.
Extreme‑condition performance : All models attain higher accuracy in bear markets, suggesting better detection of downside risk. Claude‑Haiku excels at identifying down‑trending stocks, GPT‑5mini excels at up‑trending stocks, and XGBoost performs worst on up‑trends.
Time‑sensitivity analysis : IC results show current VLMs mainly capture short‑term market signals and lack robust long‑term trend understanding, making them more appropriate as short‑term assistants rather than primary long‑term investment tools.
Discussion
Multi‑scale candlestick analysis aligns structural fundamentals with dynamic price action but inherits lag because patterns are confirmed retrospectively. Despite this lag, candlestick charts consistently outperform equivalent tabular data. Empirically, VLMs surpass XGBoost in IC significance and median IC improvement, demonstrating better prediction stability, reduced tail‑risk failures, and larger optimization potential. The advantage stems from the visual representation’s pre‑structured encoding of centuries‑old technical heuristics, allowing VLMs to transfer visual priors (shape recognition, hierarchical attention) to financial patterns, whereas tabular models must learn temporal rules from raw numeric vectors.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
