Can Large Vision‑Language Models Really Understand Candlestick Charts?

This paper builds a multi‑scale candlestick‑chart dataset and a standardized evaluation framework to measure how well visual language models (VLMs) extract price information, using confusion‑matrix diagnostics and Information Coefficient (IC) metrics, and finds that VLMs excel only on monotonic trends and struggle with precise time‑based predictions.

Bighead's Algorithm Notes
Bighead's Algorithm Notes
Bighead's Algorithm Notes
Can Large Vision‑Language Models Really Understand Candlestick Charts?

Background

Visual language models (VLMs) have been applied to stock‑price prediction, but existing multimodal benchmarks mix visual and textual signals, preventing isolation of visual contribution and ignoring the multi‑time‑scale analysis used by professional traders.

Problem Definition

Limitations of existing benchmarks : cannot isolate visual contribution, lack rigorous ablation, and ignore multi‑scale analysis.

Missing multi‑scale analysis : without short‑ and long‑term visual market signals, a model’s ability to capture the full spectrum of market dynamics remains unknown.

Method

Dataset construction

Daily OHLCV records from TuShare (Chinese A‑share) and Yahoo Finance (US S&P 500) covering 2015‑2025 are used. For each stock and cutoff date, multi‑frequency candlestick images are generated by reconstructing OHLCV data, selecting the desired frequency (daily or weekly), truncating at the cutoff, limiting the number of candles displayed, and rendering the chart. Aligned numeric time‑series features are also built to support baseline models.

Benchmark task definition

Regression: predict the 30‑day forward return r = (P_{t+30} - P_t) / P_t from a pair of daily and weekly candlestick images of the same stock on the same date.

Prompt engineering

The model is assigned the role of a “high‑discrimination stock‑trend analyst” and instructed to output a single numeric score in the range [-0.5, 1.0], rounded to three decimals and wrapped in <score> tags. Few‑shot examples are provided to guide the model.

Evaluation metrics

Confusion‑matrix metrics: TP, TN, FP, FN, accuracy, precision, recall, specificity, F1.

Information Coefficient (IC): Spearman rank correlation between predicted scores and actual returns, reported as average IC, median IC, IC‑IR, and proportion of statistically significant ICs.

Experiments

Dataset statistics

The final dataset contains 69,744 candlestick samples and 2,005,279 raw OHLCV records, covering 300 CSI 300 constituents and 500 S&P 500 constituents across 32 cutoff dates.

Experimental setup

Test window: 2023‑01‑01 to 2025‑01‑01. Evaluation is performed primarily on CSI 300 stocks with a subset on S&P 500. Seven commercial VLMs are compared against an XGBoost baseline trained on the aligned numeric features.

Results

Classification performance : Claude‑Haiku achieves high precision (suitable for conservative strategies), while GPT‑5mini achieves high recall (suitable for aggressive strategies).

Prediction bias analysis : Different architectures exhibit distinct directional biases; integrating models with complementary biases mitigates overall bias.

Information Coefficient analysis : VLMs obtain higher average and median IC, as well as higher IC‑IR, than XGBoost, indicating stronger ability to predict return magnitude.

Extreme‑condition performance : All models attain higher accuracy in bear markets, suggesting better detection of downside risk. Claude‑Haiku excels at identifying down‑trending stocks, GPT‑5mini excels at up‑trending stocks, and XGBoost performs worst on up‑trends.

Time‑sensitivity analysis : IC results show current VLMs mainly capture short‑term market signals and lack robust long‑term trend understanding, making them more appropriate as short‑term assistants rather than primary long‑term investment tools.

Discussion

Multi‑scale candlestick analysis aligns structural fundamentals with dynamic price action but inherits lag because patterns are confirmed retrospectively. Despite this lag, candlestick charts consistently outperform equivalent tabular data. Empirically, VLMs surpass XGBoost in IC significance and median IC improvement, demonstrating better prediction stability, reduced tail‑risk failures, and larger optimization potential. The advantage stems from the visual representation’s pre‑structured encoding of centuries‑old technical heuristics, allowing VLMs to transfer visual priors (shape recognition, hierarchical attention) to financial patterns, whereas tabular models must learn temporal rules from raw numeric vectors.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

Prompt engineeringvisual language modelsstock predictioncandlestick chartinformation coefficientmultiscale dataset
Bighead's Algorithm Notes
Written by

Bighead's Algorithm Notes

Focused on AI applications in the fintech sector

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.