How to Evaluate Time Series Forecasting Models: Baselines, MAPE & MSE
This lesson explains how to set baseline reference models, use MAPE and MSE as quantitative evaluation metrics, and demonstrates their application on quarterly EPS data and simulated random walks to reveal the limits of predictability in non‑stationary series.
Learning objectives – Before building complex models, establish a baseline (historical mean, last value, seasonal naive) and quantify performance with MAPE and MSE. These tools replace subjective visual checks.
Baseline models – Historical mean suits long‑term stable series; last value captures strong inertia; seasonal naive reuses the same quarter from previous years. The baseline acts as a "passing line" for any ARIMA or deep‑learning model.
Evaluation metrics
MAPE (Mean Absolute Percentage Error) expresses deviation as a percentage, is scale‑independent, and is widely comparable across business metrics. Smaller values indicate better accuracy; industry practice treats MAPE < 10 % as excellent. It fails when actual values approach zero.
MSE (Mean Squared Error) squares errors before averaging, making it sensitive to large deviations and able to handle zero actuals, though its unit differs from the original data.
Case 1: Quarterly EPS baseline comparison
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.metrics import mean_squared_error, mean_absolute_percentage_error
from statsmodels.tsa.stattools import adfuller
from statsmodels.graphics.tsaplots import plot_acf
# EPS data
eps_values = np.array([0.71, 0.63, 0.85, 0.44, 0.61, 0.69, 0.92, 0.55,
0.72, 0.77, 0.92, 0.60, 0.83, 0.80, 1.00, 0.77,
0.92, 1.00, 1.24, 1.00, 1.16, 1.30, 1.45, 1.25,
1.26, 1.38, 1.86, 1.56, 1.53, 1.59, 1.83, 1.86,
1.53, 2.07, 2.34, 2.25, 2.16, 2.43, 2.70, 2.25,
2.79, 3.42, 3.69, 3.60, 3.60, 4.32, 4.32, 4.05,
4.86, 5.04, 5.04, 4.41, 5.58, 5.85, 6.57, 5.31,
6.03, 6.39, 6.93, 5.85, 6.93, 7.74, 7.83, 6.12,
7.74, 8.91, 8.28, 6.84, 9.54, 10.26, 9.54, 8.73,
11.88, 12.06, 12.15, 8.91, 14.04, 12.96, 14.85, 9.99,
16.20, 14.67, 16.02, 11.61])
eps_index = pd.period_range("1960Q1", periods=len(eps_values), freq="Q").to_timestamp()
eps = pd.Series(eps_values, index=eps_index, name="EPS")
train = eps.iloc[:80]
test = eps.iloc[80:84]
forecasts = pd.DataFrame(index=test.index)
forecasts["历史均值"] = train.mean()
forecasts["最后值"] = train.iloc[-1]
forecasts["季节性朴素"] = train.iloc[-4:].to_numpy()
forecasts["实际值"] = test
rows = []
for name in ["历史均值", "最后值", "季节性朴素"]:
rows.append({
"baseline": name,
"MAPE": mean_absolute_percentage_error(test, forecasts[name]) * 100,
"MSE": mean_squared_error(test, forecasts[name])
})
metrics = pd.DataFrame(rows).set_index("baseline")
print(metrics)The resulting metrics show that the seasonal naive method achieves the lowest MAPE (≈11.56 %) and MSE (≈2.90), while the historical mean performs worst (MAPE ≈ 70 %). Visual comparison confirms the seasonal pattern in EPS data.
Random walk simulation
rng = np.random.default_rng(42)
steps = rng.standard_normal(1000)
steps[0] = 0
random_walk = pd.Series(np.cumsum(steps), name="random_walk")
plt.figure(figsize=(12, 4.8))
plt.plot(random_walk, color="#2563eb", linewidth=1.2)
plt.title("模拟随机游走")
plt.xlabel("时间点")
plt.ylabel("取值")
plt.grid(True, alpha=0.25)
plt.show()ACF of the raw series decays slowly, indicating non‑stationarity. The Augmented Dickey‑Fuller test returns p = 0.5787 > 0.05, so the null hypothesis of a unit root cannot be rejected.
After first‑order differencing:
diff_series = pd.Series(np.diff(random_walk), name="diff")
fig, ax = plt.subplots(figsize=(12, 4))
plot_acf(diff_series, lags=30, ax=ax)
ax.set_title("一阶差分后 ACF(白噪声特征)")
plt.grid(True, alpha=0.2)
plt.show()
result_diff = adfuller(diff_series)
print(f"差分后ADF p值: {result_diff[1]:.4f}")The differenced series yields p ≈ 0.0000 < 0.05, confirming stationarity and white‑noise behavior, which explains why random walks are inherently unpredictable.
Experiment 3: Predicting the last value of a random walk
train_rw = random_walk.iloc[:800]
test_rw = random_walk.iloc[800:]
pred_value = train_rw.iloc[-1]
pred_rw = pd.Series(np.full(len(test_rw), pred_value), index=test_rw.index)
mse_rw = mean_squared_error(test_rw, pred_rw)
print(f"随机游走测试集MSE: {mse_rw:.4f}")The MSE of 88.58 shows that simply using the last observed value is the optimal baseline for a random walk; even sophisticated ML/DL models struggle to surpass it.
Key takeaways
Baseline models (historical mean, last value, seasonal naive) provide a necessary performance floor for any forecasting project.
MAPE offers an intuitive, scale‑independent error measure; MSE captures large deviations and works with zero values. Using both gives a balanced assessment.
Random walks are non‑stationary; after differencing they become white noise, confirming the theoretical prediction limit. Simple baselines often outperform complex models on such data.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
xkx's Tech General Store
Code with the left hand, enjoy with the right; a keystroke sweeps away worries.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
