IKNet: Explainable Stock Price Forecasting with News Keywords and Technical Indicators
IKNet combines FinBERT‑derived news keywords with technical‑indicator time series, uses SHAP to quantify each feature's impact, and achieves a 32.9% RMSE reduction and 18.5% higher cumulative returns on the S&P 500 (2015‑2024) compared with RNN and Transformer baselines, while providing fine‑grained, context‑aware explanations of price movements.
Background
Accurate stock‑price prediction is essential for profit maximisation, asset‑allocation optimisation and risk management. Market dynamics are driven by geopolitics, macro‑economics and investor sentiment, creating non‑linear relationships that traditional linear models (ARIMA, GARCH, technical indicators) cannot capture. Machine‑learning models (SVM, Random Forest) handle non‑linearity but ignore temporal dependencies, while deep‑learning models (RNN, Transformer) focus on structured numeric inputs and struggle to integrate unstructured news text. Existing news‑driven approaches typically use document‑level sentiment scores or averaged embeddings, which hide the contribution of individual words and limit interpretability.
Problem Definition
Insufficient interpretability: document‑level sentiment or averaged embeddings prevent quantifying the effect of single keywords.
Inefficient information integration: fusion of structured technical indicators with unstructured news does not exploit semantic links between keywords and price movements.
Weak dynamic adaptability: difficulty in reliably capturing the impact of sudden external events (e.g., policy announcements, breaking news) during high‑volatility periods.
Method
FinBERT Keyword Extraction
FinBERT, a pre‑trained financial language model, processes each news article. For every token a significance score is computed as the gradient norm of the model output with respect to the token embedding: s_i = \|\partial p / \partial e_i\|. Tokens are ranked by average significance and the top‑n tokens are selected as keywords; their embeddings are recomputed with FinBERT.
Keyword Encoding Module
Each selected keyword embedding passes through an independent non‑linear projection layer (linear transformation → ReLU → Dropout) to preserve feature separability. A GRU then captures temporal dependencies among the keyword sequence, producing the news feature vector h_{news}.
Technical Indicator Encoding Module
Technical indicators are derived from Yahoo Finance OHLCV data (17 indicators such as SMA, RSI, MACD, Bollinger Bands) over the past T days. A bidirectional LSTM processes the indicator sequence, and average pooling yields the technical feature vector h_{price}.
Feature Fusion and Prediction
The news vector h_{news} and technical vector h_{price} are concatenated into h_{combined}, passed through a non‑linear projection layer (linear → ReLU → Dropout), and fed to a regression head that outputs the next‑day closing price.
SHAP Explainability Analysis
Kernel SHAP approximates the model output as a linear combination of input features (keywords + technical indicators): f(z) \approx \phi_0 + \sum_i \phi_i \cdot z_i, where \phi_i denotes the SHAP value indicating the contribution direction and magnitude of feature i. This provides fine‑grained attribution of price predictions to individual keywords and indicator components.
Experiments
Dataset and Settings
The evaluation uses S&P 500 data from 2015‑2024, covering stable, trade‑war and pandemic periods. News articles are collected from Google News, filtered to ~2,500 pieces, and full texts are extracted via HTML parsing. Technical indicators are computed from Yahoo Finance OHLCV data. A rolling‑window validation (3 years training + 1 year testing, 7‑fold) prevents data leakage.
Baselines
Compared baselines include traditional Ridge regression, sequence models (LSTM, Transformer, TCN), and news‑fusion models (FinBERT‑Attention‑LSTM, FinBERT‑Sentiment‑LSTM) that rely on document‑level sentiment or embeddings.
Results
Keyword Quantity Optimisation
Using 17 keywords yields the best RMSE (61.107) and SMAPE (1.340). Both more and fewer keywords degrade performance (e.g., 21 keywords increase RMSE to 48.906), indicating a trade‑off between information richness and model complexity.
Prediction Performance Comparison
IKNet consistently outperforms baselines across years; for 2024, RMSE = 58.006 and SMAPE = 0.850, a 32.9% RMSE reduction versus Ridge and a 33.8% reduction versus FinBERT‑Sentiment‑LSTM. Visualisation shows IKNet closely tracks actual prices during high‑volatility periods.
Input Ablation Study
Models using only technical data or only keywords perform worse. The full model (technical + keywords) reduces 2024 RMSE from 126.405 (single‑input) to 58.006. Diebold‑Mariano tests (p < 0.05) confirm the statistical significance of the fusion advantage.
Investment Return Analysis
IKNet achieves the highest cumulative profit and Sharpe ratio in most years (e.g., 2024 profit = 23.18 %, Sharpe = 1.806). During the 2022 high‑volatility period (HV = 73.79 %), IKNet loses only 5.579 % versus a Long‑only loss of 13.698 %.
SHAP Explainability Validation
Keywords such as “tumbled” and “plunged” obtain higher SHAP values than technical indicators like SMA deviation, confirming that news keywords drive predictions. Negative words (e.g., “layoffs”, “hurt”) substantially depress the predicted price, matching observed market drops (S&P 500 down 3.0 %).
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
